Local RAG System

Local RAG System

Overview

Retrieval-Augmented Generation (RAG) has become the dominant architectural pattern for grounding Large Language Models (LLMs) in proprietary or domain-specific knowledge. Instead of relying solely on model weights, RAG combines semantic retrieval with generation, improving accuracy, reducing hallucinations, and enabling explainability through citations.

In this article, I’ll walk through the design and implementation of a fully local RAG system using:

  • FastAPI for the API layer
  • PostgreSQL + pgvector as the vector store
  • LangChain for orchestration
  • Ollama + Mistral for local LLM inference
  • Ollama embeddings for vectorization

The system runs entirely offline, requires no external APIs, and is suitable for air-gapped or privacy-sensitive environments.

Architecture

At a high level, the system consists of five stages:

  1. Ingestion – Load documents from local files or URLs
  2. Chunking – Split content into semantically meaningful text chunks
  3. Vectorization – Convert chunks into embeddings
  4. Retrieval – Perform semantic search using pgvector
  5. Generation – Generate grounded answers using a local LLM
Documents
   ↓
Chunking
   ↓
Embeddings (Ollama)
   ↓
PostgreSQL (pgvector)
   ↓
Top-K Semantic Retrieval
   ↓
LLM (Mistral via Ollama)

Why Fully Local?

Many RAG examples rely on cloud LLMs and hosted vector databases. While convenient, that approach introduces:

  • Data privacy concerns
  • Ongoing inference costs
  • Latency variability
  • Vendor lock-in

This project demonstrates that production-grade RAG pipelines can run locally, with predictable performance and full data ownership.

Technology Stack

API Layer – FastAPI

FastAPI provides:

  • Clear request/response schemas
  • Async support
  • Automatic OpenAPI documentation

Endpoints:

  • POST /ask – RAG-powered question answering
  • POST /agent – Tool-based agent execution
  • GET /health – Service health check

Vector Store – PostgreSQL + pgvector

Instead of a managed vector database, we use pgvector, which allows vector similarity search directly inside PostgreSQL.

Key design choices:

  • Cosine similarity
  • IVFFLAT index for scalable approximate search
  • Fixed vector dimension (768) matching the embedding model
CREATE TABLE embeddings (
  chunk_id TEXT PRIMARY KEY,
  embedding vector(768) NOT NULL
);

CREATE INDEX idx_embeddings_ivfflat
ON embeddings USING ivfflat (embedding vector_cosine_ops);

This keeps operational complexity low while still supporting efficient semantic search.


Embeddings – Ollama

Embeddings are generated locally using Ollama’s embedding models (e.g. nomic-embed-text).

Benefits:

  • No Python ML stack (no torch, CUDA, etc.)
  • Consistent API with generation models
  • Easy model swapping
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

LLM – Mistral via Ollama

For generation, the system uses Mistral served locally through Ollama.

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="mistral",
    base_url="http://localhost:11434",
    temperature=0.2,
)

This allows:

  • Deterministic, low-latency responses
  • Full offline operation
  • Easy experimentation with other models

RAG Pipeline Implementation

Chunking Strategy

We use recursive character-based chunking with overlap to preserve semantic continuity.

RecursiveCharacterTextSplitter(
    chunk_size=900,
    chunk_overlap=150
)

This balances retrieval accuracy with token efficiency.

Retrieval

At query time:

  1. The question is embedded
  2. A top-K cosine similarity search is executed in Postgres
  3. Matching chunks are assembled into a context window

Each retrieved chunk is tracked for citation and explainability.

Prompting & Grounding

The system prompt enforces grounding:

“Answer using ONLY the provided context. If the context is insufficient, say so.”

This dramatically reduces hallucinations and ensures traceable outputs.

Agent Support

In addition to standard RAG, the system includes a lightweight agent abstraction using LangChain tools. This demonstrates:

  • Tool calling
  • Multi-step reasoning
  • Separation between retrieval and orchestration logic

This is particularly useful for workflows that extend beyond Q&A (e.g. summarization, validation, enrichment).

Testing the System

Example query:

curl http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question":"How does authentication work in the platform?"}'

The response includes:

  • A synthesized answer
  • Supporting citations
  • Confidence scores from vector similarity

Lessons Learned

Key takeaways from building this system:

  • Local-first RAG is practical with modern tooling
  • pgvector is sufficient for many real-world workloads
  • Ollama removes most ML operational friction
  • LangChain’s modularization requires careful dependency management
  • Fixed embedding dimensions must match DB schema exactly

Future Enhancements

Potential next steps:

  • MMR or hybrid (BM25 + vector) retrieval
  • Cross-encoder reranking
  • Streaming token responses
  • Evaluation harness (precision@k, faithfulness)
  • Multi-tenant document isolation

Conclusion

This project demonstrates a production-aligned, privacy-preserving RAG architecture using open-source tools. It shows how modern LLM systems can be built with:

  • Clear separation of concerns
  • Strong grounding guarantees
  • Minimal infrastructure overhead
  • Full local control

For engineers evaluating RAG architectures, this approach provides a solid baseline that scales from laptop demos to controlled production environments.

Sample Repository

https://github.com/romaan/rag-showcase