Local RAG System

Overview

Retrieval-Augmented Generation (RAG) has become the dominant architectural pattern for grounding Large Language Models (LLMs) in proprietary or domain-specific knowledge. Instead of relying solely on model weights, RAG combines semantic retrieval with generation, improving accuracy, reducing hallucinations, and enabling explainability through citations.

In this article, I’ll walk through the design and implementation of a fully local RAG system using:

FastAPI for the API layer
PostgreSQL + pgvector as the vector store
LangChain for orchestration
Ollama + Mistral for local LLM inference
Ollama embeddings for vectorization

The system runs entirely offline, requires no external APIs, and is suitable for air-gapped or privacy-sensitive environments.

Architecture

At a high level, the system consists of five stages:

Ingestion – Load documents from local files or URLs
Chunking – Split content into semantically meaningful text chunks
Vectorization – Convert chunks into embeddings
Retrieval – Perform semantic search using pgvector
Generation – Generate grounded answers using a local LLM

Documents
   ↓
Chunking
   ↓
Embeddings (Ollama)
   ↓
PostgreSQL (pgvector)
   ↓
Top-K Semantic Retrieval
   ↓
LLM (Mistral via Ollama)

Why Fully Local?

Many RAG examples rely on cloud LLMs and hosted vector databases. While convenient, that approach introduces:

Data privacy concerns
Ongoing inference costs
Latency variability
Vendor lock-in

This project demonstrates that production-grade RAG pipelines can run locally, with predictable performance and full data ownership.

Technology Stack

API Layer – FastAPI

FastAPI provides:

Clear request/response schemas
Async support
Automatic OpenAPI documentation

Endpoints:

POST /ask – RAG-powered question answering
POST /agent – Tool-based agent execution
GET /health – Service health check

Vector Store – PostgreSQL + pgvector

Instead of a managed vector database, we use pgvector, which allows vector similarity search directly inside PostgreSQL.

Key design choices:

Cosine similarity
IVFFLAT index for scalable approximate search
Fixed vector dimension (768) matching the embedding model

CREATE TABLE embeddings (
  chunk_id TEXT PRIMARY KEY,
  embedding vector(768) NOT NULL
);

CREATE INDEX idx_embeddings_ivfflat
ON embeddings USING ivfflat (embedding vector_cosine_ops);

This keeps operational complexity low while still supporting efficient semantic search.

Embeddings – Ollama

Embeddings are generated locally using Ollama’s embedding models (e.g. nomic-embed-text).

Benefits:

No Python ML stack (no torch, CUDA, etc.)
Consistent API with generation models
Easy model swapping

from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

LLM – Mistral via Ollama

For generation, the system uses Mistral served locally through Ollama.

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="mistral",
    base_url="http://localhost:11434",
    temperature=0.2,
)

This allows:

Deterministic, low-latency responses
Full offline operation
Easy experimentation with other models

RAG Pipeline Implementation

Chunking Strategy

We use recursive character-based chunking with overlap to preserve semantic continuity.

RecursiveCharacterTextSplitter(
    chunk_size=900,
    chunk_overlap=150
)

This balances retrieval accuracy with token efficiency.

Retrieval

At query time:

The question is embedded
A top-K cosine similarity search is executed in Postgres
Matching chunks are assembled into a context window

Each retrieved chunk is tracked for citation and explainability.

Prompting & Grounding

The system prompt enforces grounding:

“Answer using ONLY the provided context. If the context is insufficient, say so.”

This dramatically reduces hallucinations and ensures traceable outputs.

Agent Support

In addition to standard RAG, the system includes a lightweight agent abstraction using LangChain tools. This demonstrates:

Tool calling
Multi-step reasoning
Separation between retrieval and orchestration logic

This is particularly useful for workflows that extend beyond Q&A (e.g. summarization, validation, enrichment).

Testing the System

Example query:

curl http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question":"How does authentication work in the platform?"}'

The response includes:

A synthesized answer
Supporting citations
Confidence scores from vector similarity

Lessons Learned

Key takeaways from building this system:

Local-first RAG is practical with modern tooling
pgvector is sufficient for many real-world workloads
Ollama removes most ML operational friction
LangChain’s modularization requires careful dependency management
Fixed embedding dimensions must match DB schema exactly

Future Enhancements

Potential next steps:

MMR or hybrid (BM25 + vector) retrieval
Cross-encoder reranking
Streaming token responses
Evaluation harness (precision@k, faithfulness)
Multi-tenant document isolation

Conclusion

This project demonstrates a production-aligned, privacy-preserving RAG architecture using open-source tools. It shows how modern LLM systems can be built with:

Clear separation of concerns
Strong grounding guarantees
Minimal infrastructure overhead
Full local control

For engineers evaluating RAG architectures, this approach provides a solid baseline that scales from laptop demos to controlled production environments.

Sample Repository

https://github.com/romaan/rag-showcase