RAG Development Part 2 — Chunking and Embedding Strategy for Better Retrieval

Why Chunking Matters

When RAG quality is disappointing, many teams immediately blame the model. In practice, chunking is often the bigger factor.

Bad chunking causes:

Important context split across chunk boundaries
Retrieval that finds partial but unusable evidence
Oversized chunks with too much noise
Repetitive chunks dominating the top results

Chunking is not text splitting. It is building searchable semantic units.

How to Think About Chunk Size

There is no universal perfect number, but good starting ranges exist:

General docs: 300 to 700 tokens
FAQ / policy snippets: 150 to 300 tokens
Long operational guides: 500 to 900 tokens
Code-heavy docs: keep explanations and code together

The right chunk is the smallest unit that can still answer a meaningful question.

Chunking Strategies to Avoid

Fixed Character Splits

def bad_chunk(text: str, size: int = 1000):
    return [text[i:i+size] for i in range(0, len(text), size)]

This often cuts headings away from bodies and breaks sentences in awkward places.

Ignoring Document Structure

Technical docs often rely on sections like “Installation”, “Authentication”, or “Troubleshooting”. If those boundaries disappear, retrieval quality usually drops.

A Better Chunking Sequence

In practice, a good sequence is:

Parse structure
Split by sections
Split only oversized sections further
Add overlap
Inherit headings and metadata

Example:

def build_chunk(title: str, section_title: str, body: str) -> str:
    return f"# {title}\n## {section_title}\n{body}"

That way, each chunk still carries enough context on its own.

Why Overlap Helps

Overlap reduces the chance that an important statement falls exactly on a chunk boundary.

Example:

Chunk A
- access token expiration returns 401
- refresh token can be used if still valid

Chunk B
- refresh token can be used if still valid
- new access token is returned after refresh

Too much overlap, however, increases duplication and ranking bias.

Code Blocks and Tables Need Special Care

Code Blocks

Do not split code in the middle
Keep explanations with examples
Preserve function or config boundaries

Tables

Preserve header names
Avoid flattening them into meaningless text
Sometimes store an additional sentence-form summary

In technical RAG systems, these structures are often critical evidence.

Choosing an Embedding Model

Useful evaluation criteria:

Language quality
Domain fit
Cost
Query latency
Vector size
Re-indexing overhead

What matters most is a repeatable evaluation process, not simply choosing the largest model.

Store Chunk Metadata Too

Chunk-level metadata supports filtering and citations.

{
  "doc_id": "auth-guide",
  "chunk_id": "auth-guide-12",
  "title": "Authentication API Guide",
  "section": "Token Refresh",
  "language": "ko",
  "updated_at": "2026-04-17T12:00:00Z"
}

Without section-level metadata, later explanations become much harder.

Indexing Strategy

A typical vector record includes:

id
text
embedding
metadata

Operationally, it is also helpful to store:

doc_id
chunk_id
content_hash
embedding_version

That makes re-indexing and rollback much easier.

Document-Type-Specific Chunking

API Docs

Endpoint-level chunks
Keep request/response examples nearby
Preserve error code sections

Incident Guides

Symptoms
Root causes
Diagnosis steps
Recovery procedures

FAQ

One question + one answer per unit

Code Docs

Keep code examples with the explanation
Avoid making import lines standalone chunks

Common Mistakes

Chunks too small
Chunks too large
Missing headings in chunk text
No embedding version management

These are some of the most common causes of unstable retrieval.

How to Evaluate Chunking Choices

Do not pick chunking strategy by intuition.

Test variants such as:

300 tokens / overlap 30
500 tokens / overlap 50
Section-based chunking
Section-based chunking with heading augmentation

Compare top-3 or top-5 retrieval quality against real questions.

Closing Thoughts

Chunking and embeddings set the baseline for retrieval quality.

A strong baseline usually means:

Preserving structure
Keeping chunk size balanced
Carrying headings and metadata forward
Comparing strategies with an evaluation set

That is what makes later retrieval and generation work much better.