TestForge Blog
← All Posts

RAG-Based AI Stock Investment Agent Part 2 — Building a Market Data, News, and Filing Knowledge Base

A practical guide to building the RAG data layer for an AI stock investment Agent. Covers price data, news, SEC filings, earnings transcripts, normalization, chunking, metadata, and freshness-aware retrieval.

TestForge Team ·

Investment RAG Is Different From General Document RAG

In a stock analysis system, information value decays quickly.

Examples:

  • A news article from two hours ago can matter far more than one from two weeks ago
  • A fresh earnings guidance update can override older narratives
  • A filing type may matter more than a general news article

That means the RAG layer must care deeply about:

  • freshness
  • event type
  • source quality
  • symbol relevance

What Data Should Be Collected?

A practical first version usually needs four data groups.

1. Market Price Data

  • daily OHLCV
  • intraday bars if needed
  • volume
  • volatility-related calculations

2. News Data

  • stock-specific news
  • sector news
  • macro news
  • analyst report summaries if available

3. Regulatory and Company Filings

  • 10-K
  • 10-Q
  • 8-K
  • earnings materials
  • guidance updates

4. Earnings Call Transcripts

  • prepared remarks
  • management commentary
  • analyst Q&A

Transcripts are especially useful for capturing nuance that structured metrics miss.

Storage Layers Should Reflect Data Type

A good split looks like this:

PostgreSQL
- symbol
- price_bar
- corporate_event
- news_article
- filing_document
- transcript_document
- analysis_run

Object Storage
- raw JSON
- original HTML/PDF
- transcript source files

pgvector
- news chunk embeddings
- filing chunk embeddings
- transcript chunk embeddings

Redis
- short-lived analysis cache
- recent symbol context cache

Keep raw and processed forms separate from the start.

Everything Should Be Anchored to the Symbol

The main organizing principle is the stock symbol.

symbol: NVDA
 ├─ price bars
 ├─ earnings events
 ├─ filing documents
 ├─ news articles
 ├─ transcript chunks
 └─ embedding chunks

Without symbol-centric organization, retrieval becomes much less coherent.

News Ingestion Pipeline

A typical pipeline looks like this:

News API Fetch
 -> duplicate filtering
 -> ticker mapping
 -> body extraction
 -> event tagging
 -> chunking
 -> embedding
 -> vector upsert

Ticker mapping is especially important.

Examples:

  • Does “Apple” refer to the company or the fruit?
  • Does a headline actually concern the target stock?
  • Has the same article already been syndicated elsewhere?

Weak ticker mapping can ruin the whole knowledge base.

Filings Should Be Treated by Type

Not every filing deserves the same weight.

Examples:

  • 8-K: event-heavy
  • 10-Q: quarterly update
  • 10-K: broad annual business context

Useful metadata:

{
  "symbol": "AAPL",
  "doc_type": "10-Q",
  "filing_date": "2026-04-15",
  "period": "Q1 2026",
  "importance": "high",
  "source": "sec"
}

This helps both retrieval and ranking.

Why Transcripts Matter

Transcripts often surface insights not present in structured financial data.

Examples:

  • whether management sounds confident or cautious
  • how often demand or supply constraints are mentioned
  • how analysts challenge guidance

That makes transcripts extremely useful for qualitative investment analysis.

Chunking Should Vary by Source Type

News

  • one article as the main unit
  • paragraph-level splitting when needed
  • preserve title and publish time

Filings

  • split by major sections
  • keep Risk Factors, Guidance, and Financial Results separate

Transcripts

  • chunk by speaker turn
  • separate prepared remarks and Q&A
  • preserve speaker role metadata

Example transcript metadata:

{
  "symbol": "MSFT",
  "source_type": "transcript",
  "speaker": "CEO",
  "section": "Prepared Remarks",
  "event_date": "2026-04-12"
}

Retrieval Should Include Time Weighting

In investing, newer information often matters more, but not always.

Examples:

  • A rumor from yesterday may matter less than an official filing from three days ago
  • A fresh article may be less important than a recent earnings release

A practical ranking formula can mix:

  • semantic similarity
  • document-type weight
  • recency
  • symbol match quality
  • source credibility
final_score = (
    similarity * 0.5
    + recency_score * 0.2
    + doc_importance * 0.2
    + symbol_match * 0.1
)

Retrieval Needs Strong Filters

Example query:

Summarize Tesla risk factors over the last two weeks

Useful filters:

  • symbol = TSLA
  • date >= now - 14d
  • source_type in [news, filing, transcript]

Without these filters, the context set becomes too noisy.

Build a Normalized Event Layer

Instead of treating each document as isolated, build event records like:

corporate_event
- symbol
- event_type
- event_time
- source_ids
- summary

Useful event types:

  • earnings_release
  • guidance_change
  • analyst_downgrade
  • regulation_news
  • product_launch

This event layer makes recent context far easier for the Agent to reason about.

Practical Scheduling

Different feeds deserve different cadence:

  • price data: minute-level or daily
  • news: every 5 to 15 minutes
  • filings: every 10 to 30 minutes
  • transcripts: event-driven or delayed batch

Full real-time infrastructure is not required on day one, but freshness monitoring is.

Common Mistakes

  • treating recency and credibility as the same thing
  • weak symbol mapping
  • ranking general news and official filings equally

These mistakes degrade analysis quality very quickly.

Closing Thoughts

The RAG layer in an investment Agent is not just a document store. It is closer to a market-event knowledge base.

The foundations are:

  • organize everything around the symbol
  • preserve structure by source type
  • make retrieval freshness-aware and importance-aware

That is what makes downstream stock analysis much more useful.