Radwan Baba

Posted on May 21

Engineering a Production-Grade RAG Pipeline with Gemini & Qdrant (Design Guide + Code)

#ai #rag #gemini #llm

When building a real Retrieval-Augmented Generation (RAG) pipeline—especially with modern multi-modal models like Gemini—there’s no one-size-fits-all blueprint. Good engineering isn't about copying what others did; it's about asking the right questions to design a system tailored to your use case. This post breaks down those key questions, explains why they matter, and shows how to practically apply them using Gemini and Qdrant.

Our Use Case: Empowering Airport Staff with Intelligent Assistance

To make this guide practical, we will focus on a real-world scenario—an intelligent assistant for airport staff. In high-pressure environments like airports, staff need instant access to accurate information and clear instructions. Our RAG-powered assistant will:

Provide accurate, protocol-compliant answers in real-time.
Enhance passenger experience with context-aware guidance.
Understand passenger context (location, type, mood).
Maintain multi-turn conversation memory for seamless interactions.

This guide will take you step-by-step through building a robust, scalable RAG pipeline for this use case.

1. Go with RAG, Fine-Tuning, or CAG?

1.1 Does your domain knowledge change rapidly?

If your field is dynamic, like medicine or law, where guidelines evolve frequently, RAG is ideal. It lets you update the knowledge base without retraining the model. If the domain is relatively static and stable, Fine-Tuning can work well. CAG (Cache-Augmented Generation) is useful in highly repetitive environments with predefined inputs and outputs, like customer support flows.

1.2 Do you need a lightweight, easily updatable model?

Fine-tuned models tend to be large and require retraining for updates. RAG decouples content from the model—just update the documents. CAG offers quick cache hits for known queries but lacks adaptability.

1.3 Do you need traceability for generated answers?

RAG offers transparency by pointing to the exact source document. Fine-Tuning bakes the knowledge into the weights—no traceability. CAG can return exact cached responses but isn’t grounded in document-level traceability.

1.4 Is it critical to avoid misinformation or hallucinations?

Fine-Tuning may introduce or reinforce hallucinated or outdated content. RAG reduces hallucination risk by generating only from retrieved content. CAG avoids hallucinations entirely—but only for known queries. Anything new falls back to the base model or fails silently.

Application (Airport Assistant): Our intelligent assistant uses RAG because airport policies, security protocols, and immigration regulations change frequently. RAG ensures always up-to-date responses without retraining, source-linked answers to maintain trust, and reduced hallucinations by grounding outputs in curated documents. While CAG could serve repeat queries like gate locations, and Fine-Tuning might work for static domains, RAG strikes the right balance of flexibility, transparency, and safety for real-time airport operations.

2. Prompt Engineering vs. Prompt Tuning

2.1 Do you have the infrastructure to maintain and deploy prompt-tuned models?

Prompt tuning requires custom infrastructure and ongoing maintenance, making it a heavier investment. If you lack the infrastructure or technical capacity, prompt engineering (crafting effective input prompts) is a simpler and safer choice.

2.2 Will zero-shot or few-shot prompts provide enough performance?

For many use cases, they will. Zero-shot prompting means the model generates a response without any examples, while few-shot provides limited examples for context. Careful prompt design can achieve high performance without the complexity of prompt tuning.

2.3 How often do prompt structures need to change for new scenarios?

If your use cases change frequently, prompt engineering is far more flexible. You can tweak prompts without any retraining.

2.4 Is it critical to maintain consistent output formats (like JSON, bullet points, or summaries)?

Prompt engineering allows you to control the output format with structured instructions, ensuring predictable responses.

Application (Airport Assistant): Our assistant relies on prompt engineering because it allows rapid adaptation. New airline policies can be integrated instantly, emergency protocols can be delivered in clear bullet points, and we avoid the complexity of maintaining multiple tuned models.

3. Choosing the Right Vector Database

3.1 What are my performance requirements (latency, throughput)?

If your application needs real-time responses, low latency is critical. Some vector databases are optimized for high-speed retrieval, while others may prioritize scalability over speed.

3.2 Should the database run locally, or can it be cloud-based?

Some databases, like Qdrant and Milvus, offer local deployment, giving you control over data privacy. Others, like Pinecone, are cloud-only, providing managed scaling but relying on an external provider.

3.3 How easily can this database scale as my data grows?

As your knowledge base expands, your vector database must handle more data without losing speed. Look for databases with strong scalability, such as Qdrant, Milvus, and Pinecone.

3.4 Does the database support hybrid search (semantic + keyword)?

Hybrid search combines semantic understanding (meaning-based) with exact keyword matching. This improves the accuracy of responses.

3.5 How strong is the ecosystem (SDKs, integrations, community)?

A well-supported database with extensive SDKs, integrations, and active community support ensures faster development and easier troubleshooting.

Application (Airport Assistant): We chose Qdrant for our assistant because it offers fast, local deployment, supports hybrid search for precise responses, scales seamlessly with growing data, and has a strong, supportive community.

4. Designing Your System Architecture

4.1 Do I need synchronous real-time results or batch/offline processing?

Real-time use cases demand fast retrieval and response times, suitable for customer support or live assistance. Batch processing is more cost-effective and scalable for periodic tasks.

4.2 Can retrieval and generation be decoupled for scalability?

Yes, and it is often recommended. By separating these two steps, you can independently scale retrieval (vector search) and generation (LLM responses).

4.3 How do I cache previously seen queries or responses?

Caching saves time and reduces costs by storing answers to frequently asked questions. A fast key-value store like Redis is ideal for this.

4.4 Should I separate API layers for retrieval and generation?

Yes, modular APIs allow you to debug, maintain, and scale each layer independently. Retrieval can focus on speed, while generation can focus on text quality.

4.5 How do I handle failures (retrieval returns nothing, LLM crashes)?

Have fallback mechanisms. If retrieval fails, return a default message or retry. If the model crashes, return cached responses or a friendly error message.

Application (Airport Assistant): Our architecture is designed for reliability. Qdrant handles fast retrieval, Redis caches frequent queries, and our APIs for retrieval and generation are separate, ensuring scalable and maintainable performance.

5. Data Management & Retrieval

5.1 What’s the format of the data (PDFs, HTML, spreadsheets)?

Different formats require different parsers. You need to extract clean text before chunking or embedding.

5.2 How do I chunk the data (headings, paragraphs, sliding window)?

Chunking affects retrieval quality. Try paragraph-based or sliding windows with overlaps to preserve context.

5.3 Do I need multi-hop retrieval (answer from several documents)?

If answers span multiple documents, your retriever should support multi-hop or post-retrieval merging.

5.4 How often does the corpus update—and how will I re-embed it?

Re-embedding is expensive. You can schedule updates or use streaming updates if data changes frequently.

5.5 What’s the quality and trustworthiness of the data?

Garbage in, garbage out. Clean and verify your sources before indexing.

Application (Airport Assistant): Our assistant maintains high data quality by indexing only verified sources (regulations, official SOPs) and updates its knowledge base weekly, with critical updates streamed instantly.

6. LLM Response Control

6.1 Should the model only answer using retrieved content?

This is safer in critical domains. Restricting generation to known content reduces hallucination risk.

6.2 Do I allow fallback to model knowledge if no context is found?

Sometimes yes, sometimes no. It depends on whether you trust the model to "guess."

6.3 How do I instruct the model to ignore irrelevant info?

Use clear system prompts or post-processing filters to reduce noise in outputs.

6.4 Should responses be extractive, abstractive, or a mix?

Extractive means copying text directly. Abstractive means rewording. A mix is often ideal — depending on your use case.

Application (Airport Assistant): Our assistant prioritizes accurate, context-driven answers. It strictly uses retrieved content for regulatory questions, allows fallback for general queries, and uses a mix of extractive and abstractive responses for clarity.

Step by Step Code Walkthrough

We’re building a production-ready RAG pipeline using Node.js, Qdrant for vector storage, and Gemini for language generation and embedding. This system allows us to parse documents (like airport protocols), chunk and embed them, index the vectors in Qdrant, and serve real-time answers by retrieving relevant context and generating tailored responses using Gemini. Here’s the full code workflow:

System Architecture

1. Qdrant Setup & Collection Handling

We use Qdrant to store document embeddings. This setup initializes the client, ensures the collection exists, and provides functions to insert and query vectors.

const { QdrantClient } = require("@qdrant/js-client-rest");
const { v4: uuidv4 } = require("uuid");

// Initialize Qdrant client
const client = new QdrantClient({ url: process.env.QDRANT_URL });

// Ensure collection "documents" exists
const ensureCollectionExists = async () => {
  try {
    const check = await client.collectionExists("documents");
    if (!check.exists) {
      await client.createCollection("documents", {
        vectors: { size: 3072, distance: "Cosine" },
      });
    }
  } catch (error) {
    console.error("Error creating collection:", error);
    throw error;
  }
};

// Insert vectors into Qdrant
const insertVectors = async (documentId, embeddings, chunks) => {
  await ensureCollectionExists();
  const points = chunks.map((chunk, i) => ({
    id: uuidv4(),
    vector: embeddings[i],
    payload: { text: chunk, document_id: documentId },
  }));

  try {
    await client.upsert("documents", { wait: true, points });
  } catch (error) {
    console.error("Insert failed:", error);
    throw error;
  }
};

// Search top 3 results from Qdrant
const searchVectors = async (queryEmbedding) => {
  await ensureCollectionExists();
  try {
    return await client.query("documents", {
      query: queryEmbedding,
      limit: 3,
      with_payload: true,
    });
  } catch (error) {
    console.error("Search error:", error);
    throw error;
  }
};

2. Document Parsing and Chunking

We extract clean text from PDF files and split it into chunks (1000–1500 tokens) to improve retrieval accuracy and context preservation.

const fs = require("fs");
const pdf = require("pdf-parse");

// Parse and chunk PDF content
const extractTextFromPDF = async (filePath) => {
  try {
    const dataBuffer = fs.readFileSync(filePath);
    const pdfData = await pdf(dataBuffer);
    if (!pdfData.text) throw new Error("Empty PDF");
    return chunkText(pdfData.text);
  } catch (error) {
    console.error("PDF parsing failed:", error);
    throw error;
  }
};

// Smart chunking to ~1000–1500 tokens
const chunkText = (text) => {
  const clean = text.replace(/(\r\n|\n|\r)+/g, " ").replace(/\s+/g, " ").trim();
  const maxSize = 1500;
  const chunks = text.match(new RegExp(`.{1,${maxSize}}(\\s|$)`, "g"));
  return chunks || [];
};

3. Embedding Generation with Gemini

Each chunk is embedded using Gemini’s embedding API. These embeddings are used for similarity search in Qdrant.

const { GoogleGenAI } = require("@google/generative-ai");
const genAI = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

// Generate embeddings for text chunks
const getEmbeddings = async (chunks) => {
  const embeddings = [];
  for (const chunk of chunks) {
    const response = await genAI.embedContent({
      model: "models/gemini-embedding-exp-03-07",
      content: chunk,
      taskType: "retrieval_document",
    });
    embeddings.push(response.embedding);
  }
  return embeddings;
};

4. Advanced Prompt Engineering with Gemini

We dynamically craft prompts with user message, context, and prior conversation. Prompts are tailored for either protocol adherence or guest experience enhancement.

// Build Context-Aware Prompt for Gemini
const getPrompt = (msg, type, context, documents, conversationHistory) => {
  let contextInfo = "";

  if (context) {
    const { mood, location, profile } = context;
    if (profile) contextInfo += `Guest profile: ${profile}. `;
    if (location) contextInfo += `Location: ${location}. `;
    if (mood) contextInfo += `Mood: ${mood}. `;
  }

  if (type === "protocol") {
    return `This is an airport scenario. Provide protocol steps for: "${msg}". Context: ${documents} Conversation History: ${conversationHistory} ${contextInfo} Keep it under 300 words.`;
  } else if (type === "experience") {
    return `This is an airport scenario. Help staff respond to: "${msg}". Focus on improving guest experience. Context: ${documents} Conversation History: ${conversationHistory} ${contextInfo} Respond in a friendly manner using Markdown, with clear actions. Keep it under 100 words.`;
  }

  throw new Error("Invalid prompt type");
};

5. Querying Gemini with RAG

This function retrieves relevant content using Qdrant, builds two prompts (protocol and experience), and queries Gemini for both in parallel.

// Query Gemini with Retrieved Context
const queryGemini = async (msg, context, interactionId) => {
  const queryEmbedding = await getEmbeddings([msg]);
  const results = await searchVectors(queryEmbedding[0]);

  const documents = results.points.map((point) => point.payload.text).join("\n\n");
  const conversationHistory = await getLastMessages(interactionId, 12);

  const protocolPrompt = getPrompt(msg, "protocol", context, documents, conversationHistory);
  const experiencePrompt = getPrompt(msg, "experience", context, documents, conversationHistory);

  const [protocolResponse, experienceResponse] = await Promise.all([
    genAI.generateContent({
      model: "models/gemini-2.5-pro-preview-05-06",
      contents: [{ role: "user", parts: [{ text: protocolPrompt }] }],
    }),
    genAI.generateContent({
      model: "models/gemini-2.5-pro-preview-05-06",
      contents: [{ role: "user", parts: [{ text: experiencePrompt }] }],
    }),
  ]);

  return {
    protocol: protocolResponse.text.trim(),
    experience: experienceResponse.text.trim(),
  };
};

const getLastMessages = async (interactionId) => {
  const sql = `
    SELECT senderType, message
    FROM messages
    WHERE interactionId = ?
    ORDER BY sendTime DESC
    LIMIT 6
  `;
  const [rows] = await db.execute(sql, [interactionId]);

  // Reverse to chronological order
  return rows.reverse().map((row) => `${row.senderType}: ${row.message}`).join("\n");
};

Takeaways

RAG with Gemini is ideal for dynamic, real-time applications.
Prompt engineering keeps responses accurate, fast, and flexible.
Clean data and structured retrieval are key to system reliability.