RAG Development Part 1 — Document Ingestion and Data Cleaning Pipeline Design
RAG quality starts with data, not the model. This post explains how to choose source documents, clean HTML/PDF/wiki data, attach metadata, and build a production-ready ingestion pipeline.
Series Roadmap
This is Part 1 of a deeper RAG development series.
- Part 1: Document ingestion and cleaning
- Part 2: Chunking and embeddings
- Part 3: Retrieval, hybrid search, and reranking
- Part 4: Answer generation and citations
- Part 5: Evaluation and operations
If you want strong RAG quality, start with the data pipeline, not with the vector database.
Why Data Comes First
Many teams start by choosing embeddings and vector stores. In practice, source quality shapes retrieval quality far more directly.
Common failure patterns:
- Old and new documents mixed together
- Menus, footers, and banners embedded with the body
- Titles disconnected from the main content
- Duplicate versions of the same content indexed separately
- Missing source and update metadata
No retrieval model can fully rescue bad source hygiene.
Define the Scope of Documents First
More documents are not automatically better.
Useful starting buckets:
Official Documentation
- Product guides
- API references
- Runbooks
- FAQs
Internal Operational Knowledge
- Incident response docs
- On-call runbooks
- Deployment checklists
- Architecture decisions
Unstructured Content
- PDFs
- Meeting notes
- Slide decks
- Email summaries
For early versions, official documentation is usually the safest place to start.
Collection Strategies by Source Type
Web Docs
Static documentation sites are often collected through HTML extraction.
The key is extracting the real body, not the surrounding layout.
from bs4 import BeautifulSoup
def extract_main_content(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("nav, footer, header, script, style, aside"):
tag.decompose()
main = soup.select_one("main, article, .content, .docs-content")
return main.get_text("\n", strip=True) if main else soup.get_text("\n", strip=True)
Wikis and Knowledge Platforms
APIs are preferable whenever they are available.
Useful metadata to save:
- Document ID
- Source URL
- Author
- Last updated time
- Space or category
PDFs
PDF ingestion is usually noisy:
- Weird line breaks
- Broken tables
- Repeated headers and footers
- Page numbers mixed into body text
PDF text should almost never be embedded without cleaning.
What the Cleaning Stage Should Do
Remove Repeated Noise
Typical noise includes:
- Global navigation
- Footer text
- Copyright notices
- Repeated banners
- Page numbers
Preserve Structure
Keep headings, subheadings, lists, tables, and code blocks whenever possible.
RAG depends on meaningful structure, not just raw text quantity.
Deduplicate
Duplicates often come from:
- Versioned pages
- Print views
- Alternate URLs
- Multi-language overlaps
Duplicate-heavy indexes often distort retrieval rankings.
Metadata Design Is More Important Than Many Teams Expect
A useful metadata record might look like:
{
"doc_id": "api-auth-001",
"source_url": "https://docs.example.com/api/auth",
"title": "Authentication API Guide",
"category": "api",
"product": "console",
"language": "ko",
"updated_at": "2026-04-17T12:00:00Z",
"visibility": "internal",
"owner": "platform-team"
}
This enables:
- Category filters
- Freshness-aware retrieval
- Permission-based filtering
- Citation links
- Operational quality analysis
Which Documents Should Be Excluded?
Good candidates for exclusion:
- Extremely short documents with no real context
- Outdated policy documents
- Temporary notes with unclear ownership
- Duplicate FAQ copies
- Sensitive documents without access control
Trusted information beats large volume.
Versioning and Change Detection
If documents change, the RAG system must change with them.
Useful strategies:
- Compare
updated_at - Compute content hashes
- Run incremental re-indexing
- Mark deleted docs explicitly
import hashlib
def content_hash(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
A Practical Pipeline Shape
Source Fetcher
-> Raw Document Store
-> Cleaning / Normalization
-> Metadata Enrichment
-> Deduplication
-> Chunking Queue
Separating raw storage from cleaned output makes reprocessing much easier later.
Common Operational Issues
- Docs change but re-indexing is delayed
- Sensitive internal docs are accidentally ingested
- HTML layout changes break extraction
These are not edge cases. They happen regularly in production.
Closing Thoughts
The first real step in RAG development is not retrieval. It is document ingestion and cleaning.
If this layer is solid, every later stage becomes easier. If it is weak, every later optimization inherits that weakness.