TestForge Blog
← All Posts

RAG Development Part 1 — Document Ingestion and Data Cleaning Pipeline Design

RAG quality starts with data, not the model. This post explains how to choose source documents, clean HTML/PDF/wiki data, attach metadata, and build a production-ready ingestion pipeline.

TestForge Team ·

Series Roadmap

This is Part 1 of a deeper RAG development series.

  • Part 1: Document ingestion and cleaning
  • Part 2: Chunking and embeddings
  • Part 3: Retrieval, hybrid search, and reranking
  • Part 4: Answer generation and citations
  • Part 5: Evaluation and operations

If you want strong RAG quality, start with the data pipeline, not with the vector database.

Why Data Comes First

Many teams start by choosing embeddings and vector stores. In practice, source quality shapes retrieval quality far more directly.

Common failure patterns:

  • Old and new documents mixed together
  • Menus, footers, and banners embedded with the body
  • Titles disconnected from the main content
  • Duplicate versions of the same content indexed separately
  • Missing source and update metadata

No retrieval model can fully rescue bad source hygiene.

Define the Scope of Documents First

More documents are not automatically better.

Useful starting buckets:

Official Documentation

  • Product guides
  • API references
  • Runbooks
  • FAQs

Internal Operational Knowledge

  • Incident response docs
  • On-call runbooks
  • Deployment checklists
  • Architecture decisions

Unstructured Content

  • PDFs
  • Meeting notes
  • Slide decks
  • Email summaries

For early versions, official documentation is usually the safest place to start.

Collection Strategies by Source Type

Web Docs

Static documentation sites are often collected through HTML extraction.

The key is extracting the real body, not the surrounding layout.

from bs4 import BeautifulSoup

def extract_main_content(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup.select("nav, footer, header, script, style, aside"):
        tag.decompose()
    main = soup.select_one("main, article, .content, .docs-content")
    return main.get_text("\n", strip=True) if main else soup.get_text("\n", strip=True)

Wikis and Knowledge Platforms

APIs are preferable whenever they are available.

Useful metadata to save:

  • Document ID
  • Source URL
  • Author
  • Last updated time
  • Space or category

PDFs

PDF ingestion is usually noisy:

  • Weird line breaks
  • Broken tables
  • Repeated headers and footers
  • Page numbers mixed into body text

PDF text should almost never be embedded without cleaning.

What the Cleaning Stage Should Do

Remove Repeated Noise

Typical noise includes:

  • Global navigation
  • Footer text
  • Copyright notices
  • Repeated banners
  • Page numbers

Preserve Structure

Keep headings, subheadings, lists, tables, and code blocks whenever possible.

RAG depends on meaningful structure, not just raw text quantity.

Deduplicate

Duplicates often come from:

  • Versioned pages
  • Print views
  • Alternate URLs
  • Multi-language overlaps

Duplicate-heavy indexes often distort retrieval rankings.

Metadata Design Is More Important Than Many Teams Expect

A useful metadata record might look like:

{
  "doc_id": "api-auth-001",
  "source_url": "https://docs.example.com/api/auth",
  "title": "Authentication API Guide",
  "category": "api",
  "product": "console",
  "language": "ko",
  "updated_at": "2026-04-17T12:00:00Z",
  "visibility": "internal",
  "owner": "platform-team"
}

This enables:

  • Category filters
  • Freshness-aware retrieval
  • Permission-based filtering
  • Citation links
  • Operational quality analysis

Which Documents Should Be Excluded?

Good candidates for exclusion:

  • Extremely short documents with no real context
  • Outdated policy documents
  • Temporary notes with unclear ownership
  • Duplicate FAQ copies
  • Sensitive documents without access control

Trusted information beats large volume.

Versioning and Change Detection

If documents change, the RAG system must change with them.

Useful strategies:

  • Compare updated_at
  • Compute content hashes
  • Run incremental re-indexing
  • Mark deleted docs explicitly
import hashlib

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

A Practical Pipeline Shape

Source Fetcher
 -> Raw Document Store
 -> Cleaning / Normalization
 -> Metadata Enrichment
 -> Deduplication
 -> Chunking Queue

Separating raw storage from cleaned output makes reprocessing much easier later.

Common Operational Issues

  • Docs change but re-indexing is delayed
  • Sensitive internal docs are accidentally ingested
  • HTML layout changes break extraction

These are not edge cases. They happen regularly in production.

Closing Thoughts

The first real step in RAG development is not retrieval. It is document ingestion and cleaning.

If this layer is solid, every later stage becomes easier. If it is weak, every later optimization inherits that weakness.