TestForge Blog
← All Posts

RAG Development Part 2 — Chunking and Embedding Strategy for Better Retrieval

Chunking and embeddings define the floor of retrieval quality. This post covers chunk size, overlap, heading preservation, code block handling, embedding model selection, and indexing strategy.

TestForge Team ·

Why Chunking Matters

When RAG quality is disappointing, many teams immediately blame the model. In practice, chunking is often the bigger factor.

Bad chunking causes:

  • Important context split across chunk boundaries
  • Retrieval that finds partial but unusable evidence
  • Oversized chunks with too much noise
  • Repetitive chunks dominating the top results

Chunking is not text splitting. It is building searchable semantic units.

How to Think About Chunk Size

There is no universal perfect number, but good starting ranges exist:

  • General docs: 300 to 700 tokens
  • FAQ / policy snippets: 150 to 300 tokens
  • Long operational guides: 500 to 900 tokens
  • Code-heavy docs: keep explanations and code together

The right chunk is the smallest unit that can still answer a meaningful question.

Chunking Strategies to Avoid

Fixed Character Splits

def bad_chunk(text: str, size: int = 1000):
    return [text[i:i+size] for i in range(0, len(text), size)]

This often cuts headings away from bodies and breaks sentences in awkward places.

Ignoring Document Structure

Technical docs often rely on sections like “Installation”, “Authentication”, or “Troubleshooting”. If those boundaries disappear, retrieval quality usually drops.

A Better Chunking Sequence

In practice, a good sequence is:

  1. Parse structure
  2. Split by sections
  3. Split only oversized sections further
  4. Add overlap
  5. Inherit headings and metadata

Example:

def build_chunk(title: str, section_title: str, body: str) -> str:
    return f"# {title}\n## {section_title}\n{body}"

That way, each chunk still carries enough context on its own.

Why Overlap Helps

Overlap reduces the chance that an important statement falls exactly on a chunk boundary.

Example:

Chunk A
- access token expiration returns 401
- refresh token can be used if still valid

Chunk B
- refresh token can be used if still valid
- new access token is returned after refresh

Too much overlap, however, increases duplication and ranking bias.

Code Blocks and Tables Need Special Care

Code Blocks

  • Do not split code in the middle
  • Keep explanations with examples
  • Preserve function or config boundaries

Tables

  • Preserve header names
  • Avoid flattening them into meaningless text
  • Sometimes store an additional sentence-form summary

In technical RAG systems, these structures are often critical evidence.

Choosing an Embedding Model

Useful evaluation criteria:

  • Language quality
  • Domain fit
  • Cost
  • Query latency
  • Vector size
  • Re-indexing overhead

What matters most is a repeatable evaluation process, not simply choosing the largest model.

Store Chunk Metadata Too

Chunk-level metadata supports filtering and citations.

{
  "doc_id": "auth-guide",
  "chunk_id": "auth-guide-12",
  "title": "Authentication API Guide",
  "section": "Token Refresh",
  "language": "ko",
  "updated_at": "2026-04-17T12:00:00Z"
}

Without section-level metadata, later explanations become much harder.

Indexing Strategy

A typical vector record includes:

  • id
  • text
  • embedding
  • metadata

Operationally, it is also helpful to store:

  • doc_id
  • chunk_id
  • content_hash
  • embedding_version

That makes re-indexing and rollback much easier.

Document-Type-Specific Chunking

API Docs

  • Endpoint-level chunks
  • Keep request/response examples nearby
  • Preserve error code sections

Incident Guides

  • Symptoms
  • Root causes
  • Diagnosis steps
  • Recovery procedures

FAQ

  • One question + one answer per unit

Code Docs

  • Keep code examples with the explanation
  • Avoid making import lines standalone chunks

Common Mistakes

  • Chunks too small
  • Chunks too large
  • Missing headings in chunk text
  • No embedding version management

These are some of the most common causes of unstable retrieval.

How to Evaluate Chunking Choices

Do not pick chunking strategy by intuition.

Test variants such as:

  • 300 tokens / overlap 30
  • 500 tokens / overlap 50
  • Section-based chunking
  • Section-based chunking with heading augmentation

Compare top-3 or top-5 retrieval quality against real questions.

Closing Thoughts

Chunking and embeddings set the baseline for retrieval quality.

A strong baseline usually means:

  • Preserving structure
  • Keeping chunk size balanced
  • Carrying headings and metadata forward
  • Comparing strategies with an evaluation set

That is what makes later retrieval and generation work much better.