Stream AI Responses to a Frontend with Server-Sent Events
AI applications stream tokens from an LLM API to the browser as they are generated. Server-Sent Events (SSE) is the standard protocol for this because it works over plain HTTP and the browser handles reconnection automatically.
This guide covers how to build and deploy an AI streaming endpoint on Railway, with both server and client implementations.
Architecture
The streaming flow has three parts:
- The frontend sends a prompt to your API service.
- The API service calls an LLM API (OpenAI, Anthropic) with streaming enabled.
- The API forwards each token chunk to the frontend as an SSE event.
The frontend displays tokens as they arrive, giving the user immediate feedback instead of waiting for the full response.
Server implementation
Express
Hono
Key details:
X-Accel-Buffering: noprevents reverse proxies from buffering the stream. Without it, the client may receive all tokens at once instead of progressively.req.on('close')detects when the client disconnects. Aborting the LLM stream avoids paying for tokens the user will never see.- Bind to
0.0.0.0so Railway can route traffic to your service.
Client implementation
The browser's EventSource API only supports GET requests. Since chat endpoints typically use POST (to send a message body), use fetch with a ReadableStream instead:
Using the Vercel AI SDK
The Vercel AI SDK provides higher-level React hooks for streaming. It works with any hosting provider, not just Vercel:
This requires the server endpoint to use the AI SDK's streamText response format. See the AI SDK documentation for server-side setup.
Deploy on Railway
SSE streaming works on Railway without special configuration.
- Deploy your API service from GitHub or the CLI.
- Set the
ANTHROPIC_API_KEY(orOPENAI_API_KEY) as a service variable. - Generate a domain for your service.
Ensure your application binds to 0.0.0.0 and reads the port from the PORT environment variable.
Railway constraints
Maximum request duration is 15 minutes. Most LLM streaming responses complete in seconds, so this limit rarely applies. For agent workflows that run for minutes, use the async worker pattern instead: the API enqueues the task, returns a job ID, and the client polls for results. See Deploy an AI Agent with Async Workers.
Common pitfalls
Tokens arrive all at once instead of streaming. This happens when something buffers the response. Check for:
- Compression middleware (e.g.,
express-compression) applied to SSE routes. Disable it for streaming endpoints or setX-Accel-Buffering: no. - A CDN or caching layer between the client and Railway. Bypass it for streaming routes.
Client disconnects are not detected. If you do not listen for the close event on the request, the server continues calling the LLM API after the user navigates away. This wastes tokens and API credits.
CORS errors when calling from a separate frontend. If your frontend and API are on different domains, configure CORS headers on the API service. In Express: app.use(cors({ origin: 'https://your-frontend.railway.app' })).
Next steps
- Choose between SSE and WebSockets - When to use SSE vs WebSockets for real-time features.
- Deploy an AI Agent with Async Workers - Handle long-running AI tasks that exceed the 15-minute limit.
- Manage environment variables - Configure API keys and URLs across services.
- Private Networking - Connect frontend and API services internally.