RAG-Based AI Stock Investment Agent Part 2 — Building a Market Data, News, and Filing Knowledge Base
A practical guide to building the RAG data layer for an AI stock investment Agent. Covers price data, news, SEC filings, earnings transcripts, normalization, chunking, metadata, and freshness-aware retrieval.
Investment RAG Is Different From General Document RAG
In a stock analysis system, information value decays quickly.
Examples:
- A news article from two hours ago can matter far more than one from two weeks ago
- A fresh earnings guidance update can override older narratives
- A filing type may matter more than a general news article
That means the RAG layer must care deeply about:
- freshness
- event type
- source quality
- symbol relevance
What Data Should Be Collected?
A practical first version usually needs four data groups.
1. Market Price Data
- daily OHLCV
- intraday bars if needed
- volume
- volatility-related calculations
2. News Data
- stock-specific news
- sector news
- macro news
- analyst report summaries if available
3. Regulatory and Company Filings
- 10-K
- 10-Q
- 8-K
- earnings materials
- guidance updates
4. Earnings Call Transcripts
- prepared remarks
- management commentary
- analyst Q&A
Transcripts are especially useful for capturing nuance that structured metrics miss.
Storage Layers Should Reflect Data Type
A good split looks like this:
PostgreSQL
- symbol
- price_bar
- corporate_event
- news_article
- filing_document
- transcript_document
- analysis_run
Object Storage
- raw JSON
- original HTML/PDF
- transcript source files
pgvector
- news chunk embeddings
- filing chunk embeddings
- transcript chunk embeddings
Redis
- short-lived analysis cache
- recent symbol context cache
Keep raw and processed forms separate from the start.
Everything Should Be Anchored to the Symbol
The main organizing principle is the stock symbol.
symbol: NVDA
├─ price bars
├─ earnings events
├─ filing documents
├─ news articles
├─ transcript chunks
└─ embedding chunks
Without symbol-centric organization, retrieval becomes much less coherent.
News Ingestion Pipeline
A typical pipeline looks like this:
News API Fetch
-> duplicate filtering
-> ticker mapping
-> body extraction
-> event tagging
-> chunking
-> embedding
-> vector upsert
Ticker mapping is especially important.
Examples:
- Does “Apple” refer to the company or the fruit?
- Does a headline actually concern the target stock?
- Has the same article already been syndicated elsewhere?
Weak ticker mapping can ruin the whole knowledge base.
Filings Should Be Treated by Type
Not every filing deserves the same weight.
Examples:
8-K: event-heavy10-Q: quarterly update10-K: broad annual business context
Useful metadata:
{
"symbol": "AAPL",
"doc_type": "10-Q",
"filing_date": "2026-04-15",
"period": "Q1 2026",
"importance": "high",
"source": "sec"
}
This helps both retrieval and ranking.
Why Transcripts Matter
Transcripts often surface insights not present in structured financial data.
Examples:
- whether management sounds confident or cautious
- how often demand or supply constraints are mentioned
- how analysts challenge guidance
That makes transcripts extremely useful for qualitative investment analysis.
Chunking Should Vary by Source Type
News
- one article as the main unit
- paragraph-level splitting when needed
- preserve title and publish time
Filings
- split by major sections
- keep Risk Factors, Guidance, and Financial Results separate
Transcripts
- chunk by speaker turn
- separate prepared remarks and Q&A
- preserve speaker role metadata
Example transcript metadata:
{
"symbol": "MSFT",
"source_type": "transcript",
"speaker": "CEO",
"section": "Prepared Remarks",
"event_date": "2026-04-12"
}
Retrieval Should Include Time Weighting
In investing, newer information often matters more, but not always.
Examples:
- A rumor from yesterday may matter less than an official filing from three days ago
- A fresh article may be less important than a recent earnings release
A practical ranking formula can mix:
- semantic similarity
- document-type weight
- recency
- symbol match quality
- source credibility
final_score = (
similarity * 0.5
+ recency_score * 0.2
+ doc_importance * 0.2
+ symbol_match * 0.1
)
Retrieval Needs Strong Filters
Example query:
Summarize Tesla risk factors over the last two weeks
Useful filters:
symbol = TSLAdate >= now - 14dsource_type in [news, filing, transcript]
Without these filters, the context set becomes too noisy.
Build a Normalized Event Layer
Instead of treating each document as isolated, build event records like:
corporate_event
- symbol
- event_type
- event_time
- source_ids
- summary
Useful event types:
- earnings_release
- guidance_change
- analyst_downgrade
- regulation_news
- product_launch
This event layer makes recent context far easier for the Agent to reason about.
Practical Scheduling
Different feeds deserve different cadence:
- price data: minute-level or daily
- news: every 5 to 15 minutes
- filings: every 10 to 30 minutes
- transcripts: event-driven or delayed batch
Full real-time infrastructure is not required on day one, but freshness monitoring is.
Common Mistakes
- treating recency and credibility as the same thing
- weak symbol mapping
- ranking general news and official filings equally
These mistakes degrade analysis quality very quickly.
Closing Thoughts
The RAG layer in an investment Agent is not just a document store. It is closer to a market-event knowledge base.
The foundations are:
- organize everything around the symbol
- preserve structure by source type
- make retrieval freshness-aware and importance-aware
That is what makes downstream stock analysis much more useful.