|Docs

Deploy a RAG Pipeline with pgvector

ragpgvectorembeddingspostgrespython

Retrieval-augmented generation (RAG) combines a vector database with an LLM to answer questions using your own data. Instead of relying only on the model's training data, the system retrieves relevant documents from a vector store and includes them in the prompt.

This guide covers deploying a RAG pipeline on Railway using Postgres with the pgvector extension for vector storage, an external embedding API for generating vectors, and an external LLM API for generation.

Railway is a CPU-based platform. Both embedding generation and text generation call external APIs (OpenAI, Cohere, etc.) over HTTP. No models run locally.

Architecture overview

The pipeline has three components:

  • Postgres with pgvector stores document chunks and their embedding vectors. Deployed from Railway's pgvector template.
  • API service accepts queries, retrieves relevant documents via similarity search, and calls the LLM with the retrieved context.
  • Ingestion script (or endpoint) chunks documents, generates embeddings via an external API, and stores them in Postgres.

Prerequisites

  • A Railway account
  • An API key from OpenAI (for both embeddings and chat completions) or separate embedding and LLM providers
  • Documents to ingest (text files, markdown, or any text content)

1. Deploy Postgres with pgvector

Railway's standard Postgres image does not include pgvector. Use the pgvector template instead:

  1. Click the button below to deploy Postgres with pgvector:

    Deploy on Railway

  2. After the template deploys, note the DATABASE_URL connection string from the service's Variables tab.

2. Create the vector table

Connect to your pgvector Postgres instance and create a table for document chunks and their embeddings:

The vector(1536) type matches OpenAI's text-embedding-3-small output dimension. Adjust the dimension if you use a different embedding model.

The HNSW index provides fast approximate nearest-neighbor search. For datasets under 100,000 rows, an IVFFlat index is also viable and uses less memory during index creation.

3. Set up the project

Create a project directory with the following files:

requirements.txt

Install dependencies locally with pip install -r requirements.txt.

4. Build the ingestion pipeline

The ingestion step chunks your documents, generates embeddings, and inserts them into Postgres:

Run locally with python ingest.py my_document.txt, or run it on Railway using the CLI: railway run python ingest.py my_document.txt. For ongoing ingestion at scale, consider the async workers pattern.

5. Build the query endpoint

The query endpoint embeds the user's question, finds similar document chunks, and sends them to the LLM as context:

6. Deploy the API service

  1. Push your code to a GitHub repository.
  2. In your Railway project (the same one with pgvector), click + New > GitHub Repo and select your repository.
  3. Set the start command to: uvicorn app:app --host 0.0.0.0 --port $PORT
  4. Add environment variables:
    • Reference DATABASE_URL from your pgvector service.
    • Set OPENAI_API_KEY to your API key.
  5. Generate a public domain under Settings > Networking.

The API service communicates with Postgres over private networking automatically since both services are in the same project.

Performance considerations

  • Embedding API costs: OpenAI's text-embedding-3-small costs $0.02 per million tokens. Cache embeddings in Postgres to avoid re-embedding the same content.
  • Query latency: The embedding API call adds 100-300ms per query. The pgvector similarity search is fast (single-digit milliseconds for datasets under 1M rows with an HNSW index).
  • Scaling: For high query volume, add horizontal replicas to the API service. All replicas share the same Postgres instance.

Next steps