RAG with Open-Source Models: Ollama + LangChain on Your Own Server

Every time you send a document to ChatGPT or Claude, that document leaves your network and travels to a data centre in the United States. For most content, that's fine. For HR appraisals, legal contracts, BFSI customer data, or government project files, it's a problem — practically, legally, and competitively. Retrieval-Augmented Generation does not require an OpenAI API key. This guide builds a fully private document Q&A system using Ollama, LangChain, and sentence-transformers: no API keys, no tokens billed per query, no data leaving your server.

Why Open-Source RAG Matters for Indian Businesses

India's Digital Personal Data Protection Act 2023 introduced obligations around cross-border data transfers. Sending confidential documents to a US-hosted API may require additional safeguards — and for organisations in banking, insurance, or government contracting, those safeguards may not exist in practice. Building RAG on your own infrastructure eliminates the transfer question entirely.

Cost is the second reason. Running a million tokens per day through GPT-4o costs approximately $15 at current pricing. On a locally hosted Llama 3.1 8B model, the cost after hardware is ₹0 per query. For a business running internal document search across 10 team members daily, GPT-4o adds up to ₹40,000–₹50,000 per month. A one-time GPU server investment pays for itself in two to three months.

Kerala government departments exploring AI-assisted document retrieval, BFSI firms handling KYC documents, and IT service companies with confidential client data are natural candidates for this architecture. So are legal firms processing case files and healthcare organisations with patient records — any context where the document content should never leave the building.

Picking the Right Open-Source Model

The open-source LLM landscape has matured rapidly. There are now several models in the 7–8B parameter range that perform well on document Q&A tasks without requiring expensive hardware. Here are the options I've tested in production-like setups:

Model VRAM Required Ideal Use Case
Llama 3.1 8B 6 GB General document Q&A, English, balanced accuracy
Mistral 7B 5 GB European compliance docs, strong English reasoning
Phi-3 Mini 3.8B 4 GB CPU-only servers, edge devices, low-budget deployments
Qwen 2.5 7B 5 GB Mixed English/Hindi/Malayalam documents, Indic content

For most Indian business RAG use cases, Llama 3.1 8B is the right starting point. If your documents include Hindi, Tamil, or Malayalam content, Qwen 2.5 7B handles Indic languages noticeably better. If you're deploying on a machine with no GPU, Phi-3 Mini 3.8B runs acceptably on CPU, though response times stretch to 20–40 seconds per query.

Setting Up Ollama

Ollama is a tool that manages local LLM downloads, quantization, and serving. It exposes a simple REST API on port 11434, which LangChain connects to as a drop-in replacement for the OpenAI endpoint.

Install on Ubuntu (the most common server OS for this use case):

curl -fsSL https://ollama.com/install.sh | sh

On macOS with Apple Silicon:

brew install ollama

Pull the model you want (this downloads the quantized weights — about 4.7 GB for Llama 3.1 8B):

ollama pull llama3.1:8b

Start the Ollama server:

ollama serve

Verify it's running with a quick curl test:

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"What is RAG?","stream":false}'

You should see a JSON response with a "response" field within a few seconds. If you're on a GPU server, Ollama automatically detects CUDA or Metal and uses the GPU without any additional configuration.

For production deployments, an AWS EC2 g4dn.xlarge (NVIDIA T4 GPU, 16 GB VRAM) runs at approximately ₹4,500 per month on a spot instance. The T4 handles Llama 3.1 8B at 20–30 tokens per second — fast enough for internal document tools where a 4–6 second response is acceptable.

Building the Document Pipeline

LangChain provides loaders for virtually every document format. Install the dependencies:

pip install langchain langchain-community langchain-ollama \
            pypdf chromadb sentence-transformers

Here is a script that loads all PDFs from a folder and prepares them for embedding:

import os
from langchain_community.document_loaders import PyPDFLoader, CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_documents_from_folder(folder_path: str):
    docs = []
    for filename in os.listdir(folder_path):
        filepath = os.path.join(folder_path, filename)
        if filename.endswith(".pdf"):
            loader = PyPDFLoader(filepath)
            docs.extend(loader.load())
        elif filename.endswith(".csv"):
            loader = CSVLoader(filepath)
            docs.extend(loader.load())
    return docs

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " "]
)

raw_docs = load_documents_from_folder("./company_docs")
chunks = splitter.split_documents(raw_docs)
print(f"Loaded {len(raw_docs)} documents → {len(chunks)} chunks")

The chunk_size of 1000 characters with 200-character overlap is a good default for business documents. Shorter chunks (500 characters) work better for FAQ-style documents where each question and answer are self-contained. Longer chunks (1500 characters) work better for narrative documents like annual reports where context spans multiple paragraphs.

LangChain's WebBaseLoader works identically — swap in a list of URLs and the same splitter pipeline applies. This lets you build RAG over a company intranet, SharePoint pages, or a Confluence wiki without downloading anything manually.

Vector Embeddings Without OpenAI

Embeddings convert text chunks into vectors that capture semantic meaning. OpenAI's text-embedding-3-small costs $0.02 per million tokens — reasonable at low volume, but adds up at scale and still sends data externally. The sentence-transformers library provides equally capable models that run entirely on-device:

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# For English documents
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# For Hindi, Malayalam, Tamil, or mixed-language documents
# embedding_model = HuggingFaceEmbeddings(
#     model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
# )

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db"
)
vectorstore.persist()
print("Embeddings stored in ./chroma_db")

The multilingual model (paraphrase-multilingual-MiniLM-L12-v2) handles Malayalam, Hindi, and Tamil without any additional configuration. It produces slightly lower accuracy than the English-only model on English text, so choose based on what the majority of your documents contain.

Chroma stores vectors in a local SQLite file — zero infrastructure overhead, runs on the same machine as Ollama. For collections exceeding 100,000 chunks, switch to Qdrant running in Docker, which uses HNSW indexing for consistent sub-50ms retrieval regardless of collection size.

The Complete RAG Query Chain

With documents embedded and stored, connecting the retrieval layer to Ollama takes about 25 lines of code:

from langchain_ollama import OllamaLLM
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embedding_model
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

llm = OllamaLLM(model="llama3.1:8b", temperature=0.1)

prompt_template = """You are a helpful assistant answering questions about
internal company documents. Use only the context below to answer.
If the answer is not in the context, say "I don't have that information."
Always cite the source document at the end of your answer.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the leave encashment policy?"})
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f" - {doc.metadata.get('source', 'unknown')} (page {doc.metadata.get('page', '?')})")

The flow is: user query → embed with the same model used during ingestion → retrieve top-5 semantically similar chunks → inject chunks into the Ollama prompt → return the generated answer plus source citations. The temperature=0.1 setting keeps the model factual and reduces hallucination — for document Q&A you want near-deterministic output, not creative variation.

Source citations at chunk level (filename + page number) are essential for trust in enterprise contexts. When an HR manager asks about a leave policy and the answer includes "Source: HR-Policy-2025.pdf, page 7", they can verify the answer against the original document rather than taking the model's word for it.

Running in Production

For a team of 10–20 users querying internal documents, a simple FastAPI wrapper around the RAG chain is sufficient. Here's the directory structure for a Dockerised deployment:

rag-app/
├── docker-compose.yml
├── ollama/           # Ollama service
├── chroma/           # Persistent vector store
├── api/
│   ├── Dockerfile
│   ├── main.py       # FastAPI app
│   └── requirements.txt

The docker-compose.yml starts three services: Ollama (port 11434), Chroma (port 8000), and the FastAPI app (port 8080). Document ingestion runs as a separate one-off script rather than on every API request.

For cloud deployment within India, DigitalOcean's Bengaluru GPU droplets (1x NVIDIA H100 80GB) run at approximately ₹15,000 per month and handle 50–100 concurrent queries without queueing. For a single-organisation internal tool with moderate traffic, an AWS EC2 g4dn.xlarge at ₹4,500/month on spot pricing is more cost-effective.

On-premises deployment on a company server with a consumer GPU (RTX 3080 or 4080) is the most economical option for organisations with a dedicated server room. A one-time hardware cost of ₹80,000–₹1,20,000 covers the GPU; ongoing costs are electricity (roughly ₹1,500–₹2,000/month at Kerala KSEB commercial rates) and maintenance.

Latency expectations: first token appears in 3–8 seconds on an 8B model with GPU, and full responses complete in 10–20 seconds depending on answer length. For internal HR or legal document search, that latency is entirely acceptable. For customer-facing applications where users expect sub-second responses, this architecture requires either a smaller model (Phi-3 Mini) or a more powerful GPU (A100, H100).

Frequently Asked Questions

Can a local RAG system handle Malayalam documents?

Yes, with the right embedding model. Use paraphrase-multilingual-MiniLM-L12-v2 from sentence-transformers — it handles Malayalam, Hindi, and Tamil without any API. Llama 3.1 can read and reason over Malayalam text, but its output quality improves when the system prompt is in English. A practical approach: embed and retrieve in Malayalam, then instruct the model to answer in whichever language the user asked in. For pure Malayalam Q&A, Qwen 2.5 7B performs noticeably better than Llama on Indic content.

How many documents can a local Chroma vector store handle before retrieval slows down?

With all-MiniLM-L6-v2 and Chroma on a standard SSD, retrieval stays under 200ms for up to 100,000 chunks — roughly 500–800 average-length PDFs. Beyond that, query latency climbs. The fix is Qdrant in Docker with HNSW indexing, which scales to millions of vectors without degradation. Retrieval is almost never the bottleneck in a local setup — Ollama inference at 3–8 seconds per query is the limiting factor, so optimising vector search before 100K chunks is premature.

Does running RAG on a private server make it DPDPA 2023 compliant?

On-premises deployment is the most legally defensible architecture under India's Digital Personal Data Protection Act 2023, because no personal data crosses a border. Cloud options in AWS Mumbai (ap-south-1) or DigitalOcean Bengaluru also keep data within India and satisfy localisation requirements. However, DPDPA 2023 imposes consent and purpose-limitation obligations regardless of where data is stored — a private RAG system eliminates cross-border transfer risk but does not replace the need for a proper privacy notice and user consent mechanisms if the documents contain personal data.