DocsTagsDev

RAG (Retrieval-Augmented Generation)

2025-11-30


Summary

Retrieval-Augmented Generation (RAG) solves the core limitations of large language models by giving them access to real, up-to-date, domain-specific information at query time. Instead of relying solely on their static training data—which leads to outdated answers, missing context, and hallucinations—RAG retrieves relevant documents from your internal sources (databases, PDFs, Confluence pages, emails, manuals, etc.) and feeds them into the model as grounding context. This allows the LLM to produce accurate, traceable, and domain-aware responses without requiring expensive fine-tuning, making it a fast, secure, and maintainable way to make LLMs truly useful for real-world enterprise knowledge.

The RAG Pipeline

  1. Document Ingestion

A RAG pipeline begins by collecting data from various sources—such as PDFs, websites, databases, manuals, or internal documents—and turning them into a consistent, machine-readable format. During this stage, text is extracted, cleaned, and tagged with metadata so it can be processed downstream.

  1. Chunking

Because documents can be long and unfocused, they are split into smaller, semantically meaningful chunks that an LLM can easily work with. This improves retrieval accuracy by ensuring the system returns precise, relevant slices of information rather than entire documents.

  1. Embedding

Each chunk is transformed into a numerical vector that captures its meaning using an embedding model. These vectors allow the system to compare the semantic similarity of text, enabling the model to find conceptually related information rather than relying on surface keyword matching.

  1. Vector Store

The embeddings and their associated metadata are stored in a specialized database designed for fast similarity search. This “semantic index” makes it possible to quickly locate the most relevant chunks when a user asks a question.

  1. Retrieval

When a query comes in, the system converts the user’s question into an embedding and searches the vector store for the closest matching chunks. The result is a set of highly relevant passages that will serve as factual grounding for the LLM’s response.

  1. Context Assembly (Prompt Construction)

The retrieved chunks are combined with the user’s question and model instructions to form a structured prompt. This step ensures that the LLM has access to the right context and is guided to answer based only on the supplied information.

  1. Generation

The LLM receives the assembled prompt and produces a response that is grounded in the retrieved data instead of relying solely on its training knowledge. This is where the pipeline delivers a context-aware, up-to-date answer tailored to the user’s query.

  1. Post-Processing

After generation, the system may refine the output—such as summarizing it, ensuring it fits a required format, adding citations, or validating JSON structure. This step ensures the final answer meets the needs of the application and adheres to safety or correctness constraints.

  1. Feedback Loop / Evaluation

A mature RAG system includes ongoing evaluation to measure the accuracy, relevance, and reliability of retrieved context and generated answers. This feedback is used to improve document chunking, retrieval strategies, and prompt design over time.

Hardware

  • GeForce RTX 5090 (32GB VRAM)
  • AMD 9
  • 96 GB RAM

Software

  • Ubuntu 24
  • nvidia-smi

System Services

  • jupyterlab.service
  • vllm.service

Starting/Stopping

systemctl status [service-name]

Models

llama3-8b-instruct

Setup

vLLM Service Setup

Configuration

  • create system user

    sudo useradd --system --create-home --home-dir /opt/vllm --shell /usr/sbin/nologin vllm

    To log in as this user:

    1. Add local user to the group:

      sudo usermod -aG vllm $USER

    2. Become the vLLM user (for setup):

      sudo -u vllm -s /bin/bash

  • application directory

    /opt/vllm/
    
  • service

    /etc/systemd/system/vllm.service
    
    [Unit]
    Description=vLLM OpenAI-compatible server
    After=network.target

    [Service]
    Type=simple
    User=vllm
    Group=vllm
    WorkingDirectory=/opt/vllm
    ExecStart=/opt/vllm/start_vllm.sh
    Restart=on-failure
    RestartSec=5

    LimitNOFILE=65535

    [Install]
    WantedBy=multi-user.target