How to Build a RAG Pipeline from Scratch

Retrieval-Augmented Generation (RAG) is how you make LLMs answer questions about your data instead of hallucinating. It's the most common pattern in production AI, and also the most misunderstood.

This guide walks through every step, with the trade-offs nobody talks about.

What RAG Actually Does

Instead of asking an LLM to answer from memory (which leads to hallucinations), RAG retrieves relevant documents first, then feeds them as context to the LLM. The model generates an answer grounded in your actual data.

User question
  → Step 1: Embed the question into a vector
  → Step 2: Search your vector store for similar document chunks
  → Step 3: Feed retrieved chunks + question to an LLM
  → Step 4: LLM generates an answer grounded in the retrieved context
Output: answer + source citations

Simple concept. The devil is in each step.

Step 1: Document Ingestion

Before anything, you need your data in a searchable format.

Common sources: PDFs, web pages, Notion docs, Google Docs, Confluence, databases, Slack messages.

Key decision: Extract text or use OCR?

Text-based PDFs → direct extraction (fast, accurate)
Scanned documents or images → OCR (slower, error-prone)
Mixed documents → detect and route accordingly

Watch out for: Tables (most extractors mangle them), headers/footers (pollute search results), and multi-column layouts (break reading order).

Step 2: Chunking

You can't feed entire documents to an LLM because they're too long and most of the content is irrelevant to any given question. Chunking splits documents into smaller, searchable pieces.

Common strategies:

| Strategy | Chunk size | Best for | |----------|-----------|----------| | Fixed-size | 256-512 tokens | General purpose, simple | | Sentence-based | 3-5 sentences | Conversational content | | Paragraph-based | Natural breaks | Well-structured documents | | Semantic | Varies | Documents with clear topic shifts | | Recursive | 512 with 50 overlap | Technical documentation |

The critical trade-off: Smaller chunks = more precise retrieval but less context per chunk. Larger chunks = more context but noisier retrieval. Most production systems land at 512 tokens with 50-token overlap.

Overlap matters. Without overlap, a relevant sentence split across two chunks becomes invisible to search. 10-15% overlap catches these edge cases.

Step 3: Embeddings

Convert each chunk into a numerical vector that captures its semantic meaning. Similar chunks produce similar vectors.

Model choices in 2026:

| Model | Dimensions | Quality | Speed | Cost | |-------|-----------|---------|-------|------| | OpenAI text-embedding-3-large | 3072 | Excellent | Fast | $0.13/M tokens | | OpenAI text-embedding-3-small | 1536 | Good | Fast | $0.02/M tokens | | Cohere embed-v3 | 1024 | Excellent | Fast | $0.10/M tokens | | BGE-large (self-hosted) | 1024 | Good | Medium | Free (compute cost) | | Nomic Embed (self-hosted) | 768 | Good | Fast | Free (compute cost) |

Key decision: Cloud API or self-hosted?

Cloud APIs are simpler but cost scales with volume
Self-hosted is free per-query but requires GPU infrastructure
Hybrid: self-hosted for bulk ingestion, cloud API for real-time queries

Step 4: Vector Storage

Store embeddings in a database optimized for similarity search.

Popular options:

| Store | Type | Best for | |-------|------|----------| | pgvector (PostgreSQL) | Extension | Teams already on Postgres (no new infra) | | Pinecone | Managed | Teams that want zero ops | | Weaviate | Self-hosted/managed | Advanced filtering + hybrid search | | Qdrant | Self-hosted/managed | High performance, Rust-based | | ChromaDB | Embedded | Prototyping and small datasets |

Our recommendation: pgvector. If you're already running PostgreSQL (and most teams are), adding pgvector means no new infrastructure, no new vendor, and your vectors live next to your relational data. For most production workloads under 10M vectors, pgvector with HNSW indexing performs well.

Step 5: Retrieval (The Hard Part)

This is where most RAG pipelines fail. Pure vector search has blind spots.

The problem with vector-only search:

Semantic search finds conceptually similar content but misses exact terms
"What is the API rate limit?" might retrieve docs about "usage quotas" but miss the page that literally says "rate limit: 100 req/s"

The solution: Hybrid search. Combine vector similarity (semantic) with BM25 keyword matching (lexical). This catches both conceptual matches and exact term matches.

Score = α × cosine_similarity + (1 - α) × BM25_score

Where α typically ranges from 0.5 to 0.7 depending on your data.

Retrieval count matters. Retrieve too few chunks (k=2) and you miss relevant context. Retrieve too many (k=20) and you dilute the signal with noise. k=5 to k=8 works well for most use cases. Some systems use a two-stage approach: retrieve k=20 with vector search, then rerank to top 5 using a cross-encoder.

Step 6: LLM Generation

Feed the retrieved chunks as context to an LLM with a carefully constructed prompt.

Prompt structure:

System: You are a helpful assistant. Answer questions based ONLY
on the provided context. If the context doesn't contain the answer,
say "I don't have enough information to answer that."

Context:
{retrieved_chunks}

User: {original_question}

Key decisions:

Model selection. GPT-4 for complex reasoning, Llama-3/Mistral for speed and cost. Route based on query complexity.
Temperature. Use 0 or 0.1 for factual Q&A. Higher temperatures increase creativity but also hallucination risk.
Citation. Ask the model to cite which chunks it used. This makes answers verifiable and builds user trust.

Common RAG Failures (and Fixes)

| Failure | Cause | Fix | |---------|-------|-----| | Hallucination despite RAG | LLM ignores context | Stronger system prompt + lower temperature | | Irrelevant retrieval | Chunks too large or too small | Adjust chunk size + add overlap | | Missing exact matches | Vector-only search | Add BM25 hybrid search | | Slow responses | Large context windows | Reduce k, use reranking | | Outdated answers | Stale embeddings | Incremental re-indexing pipeline |

Building RAG in Sinapsis AI

Sinapsis AI has the entire RAG pipeline built in:

Document ingestion with automatic format detection (PDF, web, docs)
Smart chunking with configurable strategies and overlap
Hybrid search combining pgvector cosine similarity with BM25 keyword matching
Per-step cost and latency tracking so you know exactly where time and money go
User analytics showing which questions get useful answers and which don't

Instead of stitching together 5 different libraries and managing the infrastructure yourself, you get a production-ready RAG pipeline with observability from day one.

The best RAG pipeline is the one you can actually monitor and improve.