RAG-Based AI Stock Investment Agent Part 2 — Building a Market Data, News, and Filing Knowledge Base

Investment RAG Is Different From General Document RAG

In a stock analysis system, information value decays quickly.

Examples:

A news article from two hours ago can matter far more than one from two weeks ago
A fresh earnings guidance update can override older narratives
A filing type may matter more than a general news article

That means the RAG layer must care deeply about:

freshness
event type
source quality
symbol relevance

What Data Should Be Collected?

A practical first version usually needs four data groups.

1. Market Price Data

daily OHLCV
intraday bars if needed
volume
volatility-related calculations

2. News Data

stock-specific news
sector news
macro news
analyst report summaries if available

3. Regulatory and Company Filings

10-K
10-Q
8-K
earnings materials
guidance updates

4. Earnings Call Transcripts

prepared remarks
management commentary
analyst Q&A

Transcripts are especially useful for capturing nuance that structured metrics miss.

Storage Layers Should Reflect Data Type

A good split looks like this:

PostgreSQL
- symbol
- price_bar
- corporate_event
- news_article
- filing_document
- transcript_document
- analysis_run

Object Storage
- raw JSON
- original HTML/PDF
- transcript source files

pgvector
- news chunk embeddings
- filing chunk embeddings
- transcript chunk embeddings

Redis
- short-lived analysis cache
- recent symbol context cache

Keep raw and processed forms separate from the start.

Everything Should Be Anchored to the Symbol

The main organizing principle is the stock symbol.

symbol: NVDA
 ├─ price bars
 ├─ earnings events
 ├─ filing documents
 ├─ news articles
 ├─ transcript chunks
 └─ embedding chunks

Without symbol-centric organization, retrieval becomes much less coherent.

News Ingestion Pipeline

A typical pipeline looks like this:

News API Fetch
 -> duplicate filtering
 -> ticker mapping
 -> body extraction
 -> event tagging
 -> chunking
 -> embedding
 -> vector upsert

Ticker mapping is especially important.

Examples:

Does “Apple” refer to the company or the fruit?
Does a headline actually concern the target stock?
Has the same article already been syndicated elsewhere?

Weak ticker mapping can ruin the whole knowledge base.

Filings Should Be Treated by Type

Not every filing deserves the same weight.

Examples:

8-K: event-heavy
10-Q: quarterly update
10-K: broad annual business context

Useful metadata:

{
  "symbol": "AAPL",
  "doc_type": "10-Q",
  "filing_date": "2026-04-15",
  "period": "Q1 2026",
  "importance": "high",
  "source": "sec"
}

This helps both retrieval and ranking.

Why Transcripts Matter

Transcripts often surface insights not present in structured financial data.

Examples:

whether management sounds confident or cautious
how often demand or supply constraints are mentioned
how analysts challenge guidance

That makes transcripts extremely useful for qualitative investment analysis.

Chunking Should Vary by Source Type

News

one article as the main unit
paragraph-level splitting when needed
preserve title and publish time

Filings

split by major sections
keep Risk Factors, Guidance, and Financial Results separate

Transcripts

chunk by speaker turn
separate prepared remarks and Q&A
preserve speaker role metadata

Example transcript metadata:

{
  "symbol": "MSFT",
  "source_type": "transcript",
  "speaker": "CEO",
  "section": "Prepared Remarks",
  "event_date": "2026-04-12"
}

Retrieval Should Include Time Weighting

In investing, newer information often matters more, but not always.

Examples:

A rumor from yesterday may matter less than an official filing from three days ago
A fresh article may be less important than a recent earnings release

A practical ranking formula can mix:

semantic similarity
document-type weight
recency
symbol match quality
source credibility

final_score = (
    similarity * 0.5
    + recency_score * 0.2
    + doc_importance * 0.2
    + symbol_match * 0.1
)

Retrieval Needs Strong Filters

Example query:

Summarize Tesla risk factors over the last two weeks

Useful filters:

symbol = TSLA
date >= now - 14d
source_type in [news, filing, transcript]

Without these filters, the context set becomes too noisy.

Build a Normalized Event Layer

Instead of treating each document as isolated, build event records like:

corporate_event
- symbol
- event_type
- event_time
- source_ids
- summary

Useful event types:

earnings_release
guidance_change
analyst_downgrade
regulation_news
product_launch

This event layer makes recent context far easier for the Agent to reason about.

Practical Scheduling

Different feeds deserve different cadence:

price data: minute-level or daily
news: every 5 to 15 minutes
filings: every 10 to 30 minutes
transcripts: event-driven or delayed batch

Full real-time infrastructure is not required on day one, but freshness monitoring is.

Common Mistakes

treating recency and credibility as the same thing
weak symbol mapping
ranking general news and official filings equally

These mistakes degrade analysis quality very quickly.

Closing Thoughts

The RAG layer in an investment Agent is not just a document store. It is closer to a market-event knowledge base.

The foundations are:

organize everything around the symbol
preserve structure by source type
make retrieval freshness-aware and importance-aware

That is what makes downstream stock analysis much more useful.