RAG (Retrieval-Augmented Generation)
2025-11-30
Summary
Retrieval-Augmented Generation (RAG) solves the core limitations of large language models by giving them access to real, up-to-date, domain-specific information at query time. Instead of relying solely on their static training data—which leads to outdated answers, missing context, and hallucinations—RAG retrieves relevant documents from your internal sources (databases, PDFs, Confluence pages, emails, manuals, etc.) and feeds them into the model as grounding context. This allows the LLM to produce accurate, traceable, and domain-aware responses without requiring expensive fine-tuning, making it a fast, secure, and maintainable way to make LLMs truly useful for real-world enterprise knowledge.
The RAG Pipeline
- Document Ingestion
A RAG pipeline begins by collecting data from various sources—such as PDFs, websites, databases, manuals, or internal documents—and turning them into a consistent, machine-readable format. During this stage, text is extracted, cleaned, and tagged with metadata so it can be processed downstream.
- Chunking
Because documents can be long and unfocused, they are split into smaller, semantically meaningful chunks that an LLM can easily work with. This improves retrieval accuracy by ensuring the system returns precise, relevant slices of information rather than entire documents.
- Embedding
Each chunk is transformed into a numerical vector that captures its meaning using an embedding model. These vectors allow the system to compare the semantic similarity of text, enabling the model to find conceptually related information rather than relying on surface keyword matching.
- Vector Store
The embeddings and their associated metadata are stored in a specialized database designed for fast similarity search. This “semantic index” makes it possible to quickly locate the most relevant chunks when a user asks a question.
- Retrieval
When a query comes in, the system converts the user’s question into an embedding and searches the vector store for the closest matching chunks. The result is a set of highly relevant passages that will serve as factual grounding for the LLM’s response.
- Context Assembly (Prompt Construction)
The retrieved chunks are combined with the user’s question and model instructions to form a structured prompt. This step ensures that the LLM has access to the right context and is guided to answer based only on the supplied information.
- Generation
The LLM receives the assembled prompt and produces a response that is grounded in the retrieved data instead of relying solely on its training knowledge. This is where the pipeline delivers a context-aware, up-to-date answer tailored to the user’s query.
- Post-Processing
After generation, the system may refine the output—such as summarizing it, ensuring it fits a required format, adding citations, or validating JSON structure. This step ensures the final answer meets the needs of the application and adheres to safety or correctness constraints.
- Feedback Loop / Evaluation
A mature RAG system includes ongoing evaluation to measure the accuracy, relevance, and reliability of retrieved context and generated answers. This feedback is used to improve document chunking, retrieval strategies, and prompt design over time.
Hardware
- GeForce RTX 5090 (32GB VRAM)
- AMD 9
- 96 GB RAM
Software
- Ubuntu 24
- nvidia-smi
System Services
- jupyterlab.service
- vllm.service
Starting/Stopping
systemctl status [service-name]
Models
Setup
vLLM Service Setup
Configuration
-
create system user
sudo useradd --system --create-home --home-dir /opt/vllm --shell /usr/sbin/nologin vllmTo log in as this user:
-
Add local user to the group:
sudo usermod -aG vllm $USER -
Become the vLLM user (for setup):
sudo -u vllm -s /bin/bash
-
-
application directory
/opt/vllm/ -
service
/etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI-compatible server
After=network.target
[Service]
Type=simple
User=vllm
Group=vllm
WorkingDirectory=/opt/vllm
ExecStart=/opt/vllm/start_vllm.sh
Restart=on-failure
RestartSec=5
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target