RAG — Building Smarter AI Applications

Large Language Models like GPT-4, Claude, and Llama are remarkably capable, but they share a fundamental limitation: their knowledge is frozen at the time of training. Ask a model about your company's internal policies, last quarter's sales data, or a recently published research paper, and it will either hallucinate an answer or politely decline. This is where Retrieval-Augmented Generation — commonly known as RAG — changes the game.

At StrikingWeb, we have been building RAG-based applications for clients across industries, from intelligent customer support systems to internal knowledge management platforms. This guide distills what we have learned about designing, implementing, and optimizing RAG architectures in production.

What Is RAG and Why Does It Matter?

Retrieval-Augmented Generation is an architecture pattern that enhances LLM responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on the model's pre-trained knowledge, RAG injects context from your own data — documents, databases, APIs, or any structured or unstructured source — directly into the prompt.

The concept was introduced in a 2020 research paper by Facebook AI Research (now Meta AI), but it has become the de facto standard for building practical AI applications in 2023. The reasons are straightforward:

Accuracy: By grounding responses in actual source documents, RAG dramatically reduces hallucinations
Currency: Your AI system can access up-to-the-minute information without retraining the underlying model
Domain specificity: RAG lets you build AI applications that are experts in your specific domain, using your proprietary data
Cost efficiency: RAG is far cheaper than fine-tuning a model, and it does not require GPU-intensive training infrastructure
Transparency: RAG systems can cite their sources, making it possible to verify and audit AI-generated responses

The RAG Architecture: How It Works

A RAG system has two primary phases: the indexing phase (offline) and the retrieval-generation phase (online).

Phase 1: Indexing (Offline)

Before your RAG system can retrieve relevant information, you need to prepare your knowledge base. This involves several steps:

Document Loading: Ingest your source documents — PDFs, web pages, databases, Confluence wikis, Notion pages, or any other data source. Tools like LangChain and LlamaIndex provide connectors for dozens of data sources.

Chunking: Split documents into smaller, semantically meaningful chunks. This is one of the most consequential decisions in RAG design. Chunks that are too large dilute the relevance signal; chunks that are too small lose context. Common strategies include:

Fixed-size chunking (e.g., 512 tokens with 50-token overlap)
Sentence-based chunking using NLP sentence boundaries
Semantic chunking that splits at topic boundaries
Recursive character splitting that respects document structure (headings, paragraphs)

Embedding: Convert each chunk into a numerical vector (embedding) using an embedding model. Popular choices include OpenAI's text-embedding-ada-002, Cohere's Embed, and open-source models like BGE and E5. The embedding captures the semantic meaning of the text, allowing mathematical comparison of similarity.

Storage: Store these embeddings in a vector database optimized for similarity search. This is where your knowledge base lives.

Phase 2: Retrieval and Generation (Online)

When a user asks a question, the following happens in real time:

Query embedding: The user's question is converted into a vector using the same embedding model
Similarity search: The query vector is compared against all stored vectors to find the most semantically similar chunks
Context assembly: The top-k most relevant chunks are assembled into a context window
Prompt construction: A prompt is built that includes the retrieved context plus the user's question, typically with instructions to answer based only on the provided context
Generation: The LLM generates a response grounded in the retrieved context

Choosing a Vector Database

The vector database is the backbone of any RAG system. We have evaluated several options and each has its strengths:

Pinecone: A fully managed vector database that excels at simplicity and scalability. It is our go-to choice for client projects where the team does not want to manage infrastructure. Pinecone handles indexing, scaling, and high availability automatically.

Weaviate: An open-source vector database with built-in hybrid search (combining vector similarity with keyword matching). Weaviate's module system lets you integrate embedding models directly into the database, reducing architectural complexity.

Qdrant: Another strong open-source option, written in Rust, that offers excellent performance and a clean API. Qdrant supports filtering alongside vector search, which is essential when you need to scope results by metadata (e.g., only search documents from a specific department).

pgvector: If you are already running PostgreSQL, pgvector adds vector similarity search as an extension. It is the simplest path for teams that want to avoid introducing a new database into their stack. Performance is adequate for moderate-scale applications (up to a few million vectors).

ChromaDB: A lightweight, developer-friendly option that is excellent for prototyping and smaller applications. ChromaDB runs in-process and requires minimal setup.

Optimizing RAG Quality

Getting a basic RAG system working is straightforward. Getting it to produce consistently high-quality results requires careful optimization across several dimensions.

Chunking Strategy Matters More Than You Think

We have seen RAG quality improve by 30 to 40 percent just by switching from naive fixed-size chunking to a strategy that respects document structure. For technical documentation, we use recursive splitting that preserves heading hierarchies. For conversational data like support tickets, we chunk by conversation turns. For legal documents, we chunk by clause or section.

Hybrid Search

Pure vector similarity search sometimes misses exact keyword matches that are critical. For example, searching for a specific product SKU or error code works poorly with embeddings alone. Hybrid search combines semantic vector search with traditional BM25 keyword search, and the results are weighted and merged. In our testing, hybrid search consistently outperforms pure vector search for business applications where exact terms matter.

Reranking

After the initial retrieval step returns the top 20 or so results, a reranking model (like Cohere Rerank or a cross-encoder model) can re-score those results for more precise relevance. Reranking adds latency (typically 100 to 200 milliseconds) but meaningfully improves the quality of the final context provided to the LLM.

Metadata Filtering

Not all chunks are created equal. Adding metadata to your chunks — source document, date, author, department, document type — and filtering at retrieval time dramatically improves relevance. If a user asks about the 2023 leave policy, filtering by document type and date before performing similarity search avoids retrieving outdated 2021 policies.

"The difference between a good RAG system and a great one is not the LLM you choose — it is the quality of your retrieval pipeline."

Common Pitfalls and How to Avoid Them

Ignoring evaluation: Without systematic evaluation, you are flying blind. Build a test set of questions with known correct answers and measure retrieval precision, recall, and answer quality on every change.
Over-stuffing context: More context is not always better. Retrieving too many chunks can confuse the LLM and degrade answer quality. We typically use 3 to 5 chunks, adjusted based on chunk size and model context window.
Neglecting data quality: Garbage in, garbage out applies doubly to RAG. Poorly formatted documents, duplicate content, and stale information will poison your results regardless of how sophisticated your retrieval pipeline is.
Skipping the prompt engineering: The system prompt that wraps your retrieved context matters enormously. Clear instructions about how to use the context, when to say "I don't know," and how to cite sources can transform output quality.

A Practical Tech Stack for RAG

Based on our production experience, here is the stack we recommend for most RAG applications:

Orchestration: LangChain or LlamaIndex for managing the retrieval and generation pipeline
Embedding model: OpenAI text-embedding-ada-002 for ease of use, or BGE-large for open-source deployments
Vector database: Pinecone for managed simplicity, Qdrant for self-hosted control
LLM: GPT-4 for highest quality, GPT-3.5-turbo for cost-sensitive applications, or Claude for long-context use cases
Reranker: Cohere Rerank for production, cross-encoder models for self-hosted
Frontend: Next.js with streaming responses for a responsive chat interface

Looking Ahead

RAG is evolving rapidly. Agentic RAG systems that can decide when and what to retrieve, multi-modal RAG that handles images and tables alongside text, and graph-based RAG that leverages knowledge graphs for more structured reasoning are all emerging patterns we are actively exploring.

If you are looking to build an AI application that leverages your organization's proprietary knowledge, RAG is almost certainly the right starting point. Our AI team at StrikingWeb has built RAG systems across industries and would be glad to help you design and implement one that delivers real business value.

AIRAGLLM