Overview
Retrieval-Augmented Generation (RAG) has become the dominant architectural pattern for grounding Large Language Models (LLMs) in proprietary or domain-specific knowledge. Instead of relying solely on model weights, RAG combines semantic retrieval with generation, improving accuracy, reducing hallucinations, and enabling explainability through citations.
In this article, I’ll walk through the design and implementation of a fully local RAG system using:
- FastAPI for the API layer
- PostgreSQL + pgvector as the vector store
- LangChain for orchestration
- Ollama + Mistral for local LLM inference
- Ollama embeddings for vectorization
The system runs entirely offline, requires no external APIs, and is suitable for air-gapped or privacy-sensitive environments.
Architecture
At a high level, the system consists of five stages:
- Ingestion – Load documents from local files or URLs
- Chunking – Split content into semantically meaningful text chunks
- Vectorization – Convert chunks into embeddings
- Retrieval – Perform semantic search using pgvector
- Generation – Generate grounded answers using a local LLM
Documents
↓
Chunking
↓
Embeddings (Ollama)
↓
PostgreSQL (pgvector)
↓
Top-K Semantic Retrieval
↓
LLM (Mistral via Ollama)
Why Fully Local?
Many RAG examples rely on cloud LLMs and hosted vector databases. While convenient, that approach introduces:
- Data privacy concerns
- Ongoing inference costs
- Latency variability
- Vendor lock-in
This project demonstrates that production-grade RAG pipelines can run locally, with predictable performance and full data ownership.
Technology Stack
API Layer – FastAPI
FastAPI provides:
- Clear request/response schemas
- Async support
- Automatic OpenAPI documentation
Endpoints:
POST /ask– RAG-powered question answeringPOST /agent– Tool-based agent executionGET /health– Service health check
Vector Store – PostgreSQL + pgvector
Instead of a managed vector database, we use pgvector, which allows vector similarity search directly inside PostgreSQL.
Key design choices:
- Cosine similarity
- IVFFLAT index for scalable approximate search
- Fixed vector dimension (768) matching the embedding model
CREATE TABLE embeddings (
chunk_id TEXT PRIMARY KEY,
embedding vector(768) NOT NULL
);
CREATE INDEX idx_embeddings_ivfflat
ON embeddings USING ivfflat (embedding vector_cosine_ops);
This keeps operational complexity low while still supporting efficient semantic search.
Embeddings – Ollama
Embeddings are generated locally using Ollama’s embedding models (e.g. nomic-embed-text).
Benefits:
- No Python ML stack (no torch, CUDA, etc.)
- Consistent API with generation models
- Easy model swapping
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434"
)
LLM – Mistral via Ollama
For generation, the system uses Mistral served locally through Ollama.
from langchain_ollama import ChatOllama
llm = ChatOllama(
model="mistral",
base_url="http://localhost:11434",
temperature=0.2,
)
This allows:
- Deterministic, low-latency responses
- Full offline operation
- Easy experimentation with other models
RAG Pipeline Implementation
Chunking Strategy
We use recursive character-based chunking with overlap to preserve semantic continuity.
RecursiveCharacterTextSplitter(
chunk_size=900,
chunk_overlap=150
)
This balances retrieval accuracy with token efficiency.
Retrieval
At query time:
- The question is embedded
- A top-K cosine similarity search is executed in Postgres
- Matching chunks are assembled into a context window
Each retrieved chunk is tracked for citation and explainability.
Prompting & Grounding
The system prompt enforces grounding:
“Answer using ONLY the provided context. If the context is insufficient, say so.”
This dramatically reduces hallucinations and ensures traceable outputs.
Agent Support
In addition to standard RAG, the system includes a lightweight agent abstraction using LangChain tools. This demonstrates:
- Tool calling
- Multi-step reasoning
- Separation between retrieval and orchestration logic
This is particularly useful for workflows that extend beyond Q&A (e.g. summarization, validation, enrichment).
Testing the System
Example query:
curl http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question":"How does authentication work in the platform?"}'
The response includes:
- A synthesized answer
- Supporting citations
- Confidence scores from vector similarity
Lessons Learned
Key takeaways from building this system:
- Local-first RAG is practical with modern tooling
- pgvector is sufficient for many real-world workloads
- Ollama removes most ML operational friction
- LangChain’s modularization requires careful dependency management
- Fixed embedding dimensions must match DB schema exactly
Future Enhancements
Potential next steps:
- MMR or hybrid (BM25 + vector) retrieval
- Cross-encoder reranking
- Streaming token responses
- Evaluation harness (precision@k, faithfulness)
- Multi-tenant document isolation
Conclusion
This project demonstrates a production-aligned, privacy-preserving RAG architecture using open-source tools. It shows how modern LLM systems can be built with:
- Clear separation of concerns
- Strong grounding guarantees
- Minimal infrastructure overhead
- Full local control
For engineers evaluating RAG architectures, this approach provides a solid baseline that scales from laptop demos to controlled production environments.

