Troubleshooting Slow Deployments and Applications

Q: Is it Railway or my app?

See the answer in the documentation at https://docs.railway.com/deployments/troubleshooting/slow-deployments#is-it-railway-or-my-app

When your deployment takes longer than expected or your application feels slow, it helps to understand what's happening behind the scenes. This guide walks you through Railway's deployment process, how to identify where slowdowns occur, and what you can do about them.

Understanding deployment phases

Every deployment on Railway goes through several distinct phases. Understanding these phases helps you identify where delays are occurring.

Phase overview

Phase	What Happens	Typical Duration
Initialization	Railway takes a snapshot of your code	Seconds
Build	Your code is built into a container image	1-10+ minutes
Pre-Deploy	Dependencies are checked and volumes are migrated if needed	Seconds to minutes
Deploy	Container is created and started	30 seconds to 2 minutes
Network	Healthchecks run (if configured)	Up to 5 minutes (configurable)
Post-Deploy	Previous deployment is drained and removed	Seconds

Detailed phase breakdown

Initialization (snapshot Code)

Railway captures a snapshot of your source code. This is typically fast unless you have an unusually large repository or many files.

Build phase

The build phase is often the longest part of a deployment. Railway uses Railpack (or a Dockerfile if present) to build your application into a container image.

Common causes of slow builds:

Large dependency trees (many npm packages, Python dependencies, etc.)
No build caching (first build or cache invalidation)
Compiling native extensions
Large assets being processed

Tip: Check the build logs to see which steps are taking the longest.

Pre-deploy

This phase handles:

Waiting for dependencies: If your service depends on another service that's also deploying, Railway waits for it to be ready
Volume migration: If you changed your service's region and it has a volume attached, the volume data must be migrated. This can take significant time depending on volume size

Deploy (creating containers)

This phase involves:

Pulling the container image to the compute node
Creating the container with your configuration
Mounting volumes if configured
Starting your application

Large container images take longer to pull. Railway caches images on compute nodes when possible, but the first deployment to a new node requires a full pull.

Network (healthchecks)

If you have a healthcheck configured, Railway queries your healthcheck endpoint until it receives an HTTP 200 response. The default timeout is 300 seconds (5 minutes).

If your application takes time to:

Initialize database connections
Load large files into memory
Warm up caches

...the healthcheck phase will reflect that startup time.

Post-deploy (drain instances)

Railway stops and removes the previous deployment. By default, old deployments are given 0 seconds to gracefully shut down (configurable via RAILWAY_DEPLOYMENT_DRAINING_SECONDS).

Is it Railway or my app?

Before diving into optimization, determine whether the slowness is on Railway's side or within your application. In the vast majority of cases, performance issues originate from the application itself, rather than the platform. This could be from inefficient queries, resource constraints, or configuration problems.

Check Railway status

Visit status.railway.com to see if there are any ongoing incidents or degraded performance affecting the platform. If there's a platform-wide issue, it will be reported here. If status shows all systems operational, the issue is almost certainly within your application or its dependencies.

Check build logs

Build logs show output from the build phase (installing dependencies, compiling code, creating the container image). The deployment view shows each phase with timing information.

Look for:

Dependency installation steps that take disproportionately long
Cache misses causing full rebuilds
Large assets being processed

Check deployment logs

Deployment logs show your application's stdout/stderr while it's running. These help diagnose runtime issues that occur after your app starts.

Look for:

Database connection errors or timeouts
Slow query warnings
Application exceptions or errors
Healthcheck failures

Check your application metrics

Railway provides metrics for CPU, memory, and network usage. High resource usage can indicate:

Your application is resource-constrained
Inefficient code paths
Memory leaks causing garbage collection pressure

For deeper insights, consider integrating an Application Performance Monitoring (APM) tool like Datadog, New Relic, or open-source alternatives like OpenTelemetry. APM tools provide distributed tracing, helping you identify slow database queries, external API calls, and bottlenecks that Railway's built-in metrics don't capture.

Analyze HTTP logs

Railway captures detailed HTTP request logs for every request to your service. These logs are invaluable for identifying slow endpoints and understanding request patterns. For complete documentation on log features and filtering syntax, see the Logs guide.

Key fields for performance troubleshooting:

Field	Description
`totalDuration`	Total time from request received to response sent (ms)
`upstreamRqDuration`	Time your application took to respond (ms)
`httpStatus`	Response status code
`path`	Request path to identify which endpoints are slow
`responseDetails`	Error details if the request failed
`txBytes` / `rxBytes`	Response and request sizes

Finding slow requests:

Use the log filter syntax to find requests exceeding a duration threshold:

This finds all requests taking longer than 1 second. You can combine filters to narrow down:

Understanding the timing fields:

totalDuration includes everything: network time to/from the edge, time in the proxy, and your application's response time
upstreamRqDuration is specifically how long your application took to respond

If totalDuration is high but upstreamRqDuration is low, the latency is in the network path (edge routing, DNS). If upstreamRqDuration is high, the slowness is in your application.

Identifying error patterns:

Filter by status code to find failing requests:

Check responseDetails for specific error information, and upstreamErrors for details about connection failures to your application.

Test locally

If your app is slow on Railway but fast locally, consider:

Are you hitting external services with higher latency?
Are you using the correct region for your database?
Is your application configured to use private networking?

Common causes of slow applications

Database queries

Slow database queries are one of the most common causes of application latency.

Symptoms:

API endpoints that worked fast are now slow
Timeouts on specific operations
High CPU on your database service

Solutions:

Add database indexes for frequently queried columns
Use connection pooling
Review slow query logs
Consider read replicas for read-heavy workloads

Wrong region configuration

If your application is in one region but your database is in another, every query incurs geographic latency as traffic travels between regions on Railway's network.

Symptoms:

Consistently high latency on all database operations (typically 50-150ms+ per query depending on distance)

Solutions:

Deploy your application in the same region as your database

Not using private networking

If services within the same project communicate over the public internet instead of private networking, you add unnecessary latency and incur egress costs. Private networking is for server-to-server communication only. It won't work for requests originating from a user's browser.

Symptoms:

Using public URLs (e.g., your-app.up.railway.app) for inter-service communication
Connection strings using public hostnames
Unexpectedly high network egress charges on your bill

Solutions:

Use *.railway.internal hostnames for service-to-service communication
Update connection strings to use private networking addresses and ports
For frontend applications that need to call backend APIs, use private networking from your server-side code (API routes, SSR) while keeping public URLs for client-side browser requests

Example:

Resource constraints

Your application may be hitting resource limits, causing throttling or OOM (out of memory) kills.

Symptoms:

Application crashes with exit code 137 (OOM killed)
Consistently high CPU usage at 100%
Slow response times during high load

Solutions:

Check your metrics to see actual resource usage
Adjust resource limits if you're consistently hitting them
Optimize your application's memory and CPU usage
Consider horizontal scaling for stateless workloads

Large container images

Large images take longer to pull, especially on first deployment to a new compute node.

Symptoms:

"Creating containers" phase takes several minutes
Large build output size shown in build logs

Solutions:

Use multi-stage Docker builds to reduce final image size
Use smaller base images (e.g., Alpine variants)
Exclude unnecessary files with .dockerignore
Remove development dependencies from production builds

Slow application startup

If your application takes time to initialize, it affects the healthcheck phase duration.

Symptoms:

Healthcheck takes a long time to pass
Application logs show initialization steps running

Solutions:

Defer non-critical initialization to after the app is ready to serve traffic
Use lazy loading for heavy dependencies
Increase healthcheck timeout if startup time is legitimate
Consider a dedicated healthcheck endpoint that responds before full initialization

What plan upgrades actually do

Upgrading your plan increases your resource limits, not guaranteed performance. Understanding this distinction is important.

What upgrading provides

Plan	Per-Replica vCPU Limit	Per-Replica Memory Limit
Hobby	8 vCPU	8 GB
Pro	24 vCPU	24 GB
Enterprise	Custom	Custom

Upgrading raises the ceiling on how many resources a single replica can use. Your application only uses what it needs, up to the limit.

When upgrading helps

Upgrading helps when:

Your metrics show you're hitting current resource limits
Your application needs more memory (e.g., processing large datasets)
You need more CPU for compute-intensive tasks
You want to run more replicas (higher replica limits on higher plans)

When upgrading doesn't help

Upgrading won't help when:

Slowness is caused by external services (databases, APIs)
Your application has inefficient code
Network latency is the bottleneck
You're not actually using your current resource allocation

Always check your metrics before upgrading. If your service uses 500MB of memory and 0.5 vCPU, upgrading from Hobby to Pro won't make it faster.

Edge routing and latency

Railway operates edge proxies in multiple regions. For a complete overview of edge infrastructure, see the Edge Networking reference. Understanding how traffic is routed helps diagnose latency issues.

How edge routing works

When a request comes in:

It hits the nearest Railway edge proxy
The edge proxy routes it to your service in the configured region
Your service processes the request and responds

You can see which edge handled a request via the X-Railway-Edge response header.

Checking the edge header

The header value shows the region, e.g., railway/us-west2.

Why traffic might hit the wrong edge

DNS caching: Your local DNS resolver may have cached an old record
CDN/Proxy interference: Services like Cloudflare route based on their own logic
Geographic routing: Users in certain regions may be routed suboptimally

Optimizing for global users

If you have users worldwide, you can use multi-region replicas to deploy stateless services closer to your users. Railway automatically routes traffic to the nearest region.

Note: Multi-region works well for stateless application servers, but databases typically run in a single region. If your app is deployed globally but your database is in one region, replicas far from the database will still experience latency on database queries. To mitigate this:

Use application-level caching to reduce database round-trips
Consider database read replicas in additional regions for read-heavy workloads
Accept the latency trade-off for writes, which must go to the primary database

Private networking and edge

Private networking (*.railway.internal) bypasses the edge entirely. Services communicate directly within Railway's infrastructure, which is faster than going through the public internet.

When to contact support

Contact Railway support through Central Station if:

Deployments are consistently slow with no apparent cause
You see 544 Railway Proxy Error responses, which indicate a platform-side issue (as opposed to 502 errors, which indicate application issues)
The status page shows no issues but you're experiencing degraded performance
You need help optimizing your deployment configuration

Tip: When reporting issues, include the X-Railway-Request-Id header from affected requests. This unique identifier helps Railway support trace your request through the infrastructure. You can find it in your HTTP response headers.