RAG Development Part 1 — Document Ingestion and Data Cleaning Pipeline Design

Series Roadmap

This is Part 1 of a deeper RAG development series.

Part 1: Document ingestion and cleaning
Part 2: Chunking and embeddings
Part 3: Retrieval, hybrid search, and reranking
Part 4: Answer generation and citations
Part 5: Evaluation and operations

If you want strong RAG quality, start with the data pipeline, not with the vector database.

Why Data Comes First

Many teams start by choosing embeddings and vector stores. In practice, source quality shapes retrieval quality far more directly.

Common failure patterns:

Old and new documents mixed together
Menus, footers, and banners embedded with the body
Titles disconnected from the main content
Duplicate versions of the same content indexed separately
Missing source and update metadata

No retrieval model can fully rescue bad source hygiene.

Define the Scope of Documents First

More documents are not automatically better.

Useful starting buckets:

Official Documentation

Product guides
API references
Runbooks
FAQs

Internal Operational Knowledge

Incident response docs
On-call runbooks
Deployment checklists
Architecture decisions

Unstructured Content

PDFs
Meeting notes
Slide decks
Email summaries

For early versions, official documentation is usually the safest place to start.

Collection Strategies by Source Type

Web Docs

Static documentation sites are often collected through HTML extraction.

The key is extracting the real body, not the surrounding layout.

from bs4 import BeautifulSoup

def extract_main_content(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup.select("nav, footer, header, script, style, aside"):
        tag.decompose()
    main = soup.select_one("main, article, .content, .docs-content")
    return main.get_text("\n", strip=True) if main else soup.get_text("\n", strip=True)

Wikis and Knowledge Platforms

APIs are preferable whenever they are available.

Useful metadata to save:

Document ID
Source URL
Author
Last updated time
Space or category

PDFs

PDF ingestion is usually noisy:

Weird line breaks
Broken tables
Repeated headers and footers
Page numbers mixed into body text

PDF text should almost never be embedded without cleaning.

What the Cleaning Stage Should Do

Remove Repeated Noise

Typical noise includes:

Global navigation
Footer text
Copyright notices
Repeated banners
Page numbers

Preserve Structure

Keep headings, subheadings, lists, tables, and code blocks whenever possible.

RAG depends on meaningful structure, not just raw text quantity.

Deduplicate

Duplicates often come from:

Versioned pages
Print views
Alternate URLs
Multi-language overlaps

Duplicate-heavy indexes often distort retrieval rankings.

Metadata Design Is More Important Than Many Teams Expect

A useful metadata record might look like:

{
  "doc_id": "api-auth-001",
  "source_url": "https://docs.example.com/api/auth",
  "title": "Authentication API Guide",
  "category": "api",
  "product": "console",
  "language": "ko",
  "updated_at": "2026-04-17T12:00:00Z",
  "visibility": "internal",
  "owner": "platform-team"
}

This enables:

Category filters
Freshness-aware retrieval
Permission-based filtering
Citation links
Operational quality analysis

Which Documents Should Be Excluded?

Good candidates for exclusion:

Extremely short documents with no real context
Outdated policy documents
Temporary notes with unclear ownership
Duplicate FAQ copies
Sensitive documents without access control

Trusted information beats large volume.

Versioning and Change Detection

If documents change, the RAG system must change with them.

Useful strategies:

Compare updated_at
Compute content hashes
Run incremental re-indexing
Mark deleted docs explicitly

import hashlib

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

A Practical Pipeline Shape

Source Fetcher
 -> Raw Document Store
 -> Cleaning / Normalization
 -> Metadata Enrichment
 -> Deduplication
 -> Chunking Queue

Separating raw storage from cleaned output makes reprocessing much easier later.

Common Operational Issues

Docs change but re-indexing is delayed
Sensitive internal docs are accidentally ingested
HTML layout changes break extraction

These are not edge cases. They happen regularly in production.

Closing Thoughts

The first real step in RAG development is not retrieval. It is document ingestion and cleaning.

If this layer is solid, every later stage becomes easier. If it is weak, every later optimization inherits that weakness.