Deploy a RAG Pipeline with pgvector
Retrieval-augmented generation (RAG) combines a vector database with an LLM to answer questions using your own data. Instead of relying only on the model's training data, the system retrieves relevant documents from a vector store and includes them in the prompt.
This guide covers deploying a RAG pipeline on Railway using Postgres with the pgvector extension for vector storage, an external embedding API for generating vectors, and an external LLM API for generation.
Railway is a CPU-based platform. Both embedding generation and text generation call external APIs (OpenAI, Cohere, etc.) over HTTP. No models run locally.
Architecture overview
The pipeline has three components:
- Postgres with pgvector stores document chunks and their embedding vectors. Deployed from Railway's pgvector template.
- API service accepts queries, retrieves relevant documents via similarity search, and calls the LLM with the retrieved context.
- Ingestion script (or endpoint) chunks documents, generates embeddings via an external API, and stores them in Postgres.
Prerequisites
- A Railway account
- An API key from OpenAI (for both embeddings and chat completions) or separate embedding and LLM providers
- Documents to ingest (text files, markdown, or any text content)
1. Deploy Postgres with pgvector
Railway's standard Postgres image does not include pgvector. Use the pgvector template instead:
-
Click the button below to deploy Postgres with pgvector:
-
After the template deploys, note the
DATABASE_URLconnection string from the service's Variables tab.
2. Create the vector table
Connect to your pgvector Postgres instance and create a table for document chunks and their embeddings:
The vector(1536) type matches OpenAI's text-embedding-3-small output dimension. Adjust the dimension if you use a different embedding model.
The HNSW index provides fast approximate nearest-neighbor search. For datasets under 100,000 rows, an IVFFlat index is also viable and uses less memory during index creation.
3. Set up the project
Create a project directory with the following files:
requirements.txt
Install dependencies locally with pip install -r requirements.txt.
4. Build the ingestion pipeline
The ingestion step chunks your documents, generates embeddings, and inserts them into Postgres:
Run locally with python ingest.py my_document.txt, or run it on Railway using the CLI: railway run python ingest.py my_document.txt. For ongoing ingestion at scale, consider the async workers pattern.
5. Build the query endpoint
The query endpoint embeds the user's question, finds similar document chunks, and sends them to the LLM as context:
6. Deploy the API service
- Push your code to a GitHub repository.
- In your Railway project (the same one with pgvector), click + New > GitHub Repo and select your repository.
- Set the start command to:
uvicorn app:app --host 0.0.0.0 --port $PORT - Add environment variables:
- Reference
DATABASE_URLfrom your pgvector service. - Set
OPENAI_API_KEYto your API key.
- Reference
- Generate a public domain under Settings > Networking.
The API service communicates with Postgres over private networking automatically since both services are in the same project.
Performance considerations
- Embedding API costs: OpenAI's
text-embedding-3-smallcosts $0.02 per million tokens. Cache embeddings in Postgres to avoid re-embedding the same content. - Query latency: The embedding API call adds 100-300ms per query. The pgvector similarity search is fast (single-digit milliseconds for datasets under 1M rows with an HNSW index).
- Scaling: For high query volume, add horizontal replicas to the API service. All replicas share the same Postgres instance.
Next steps
- Deploy an AI Chatbot with Streaming Responses: Add a chat UI with streaming on top of your RAG pipeline.
- Deploy an AI Agent with Async Workers: Process large document ingestion jobs asynchronously.
- PostgreSQL on Railway: Connection pooling, backups, and configuration.
- Scaling: Configure horizontal and vertical scaling for your services.