A load balancer needs to know which backends are healthy. The mechanism is a health check — a periodic probe that asks “are you OK?” The backend responds; the LB decides. The whole thing sounds simple.

The reality: badly-designed health checks cause more outages than they prevent. Cascading failures. Healthy-but-broken backends serving real traffic. Slow checks under load. This post covers the practical patterns for designing health checks that work.

What Health Checks Do

The LB cycles through backends, periodically checking each:

Send a probe (TCP connect, HTTP GET, custom request).
Wait for response (or timeout).
If success, mark or keep marked as healthy.
If failure, increment fail counter.
After N consecutive failures, mark unhealthy. Stop routing traffic.

When an unhealthy backend recovers: 6. Continue probing. 7. After M consecutive successes, mark healthy again. Resume routing.

The N and M thresholds (often 2-3) prevent flapping from one-off transient issues.

Types of Health Checks

TCP check

“Open a TCP connection to port 443. If successful, healthy.”

Pros: Fast; minimal.
Cons: Weakest signal. The TCP listener is up; the application could still be broken.

Useful for L4 LBs and as a basic liveness check.

HTTP check

“GET /health on port 443. If response is 2xx, healthy.”

Pros: Tells you HTTP is working.
Cons: Still might not exercise real code paths.

Standard for L7 LBs.

Custom application check

A dedicated /health endpoint that runs internal checks:

app.get('/health', async (req, res) => {
    try {
        await db.ping()
        await cache.ping()
        return res.json({ status: 'ok' })
    } catch (e) {
        return res.status(503).json({ status: 'down', error: e.message })
    }
})

Pros: Tests real dependencies.
Cons: Must be designed carefully — too thorough and it fails under load.

gRPC health check

For gRPC services, use the standard grpc.health.v1.Health service. Tooling is consistent.

Designing a Good Health Check

Cheap to execute

The health check runs on every check interval. If it’s expensive, you’re constantly burning resources just to be checked.

A health check that does a database query, a cache lookup, and an external API call… is going to cost real money at 1000+ backends × 1 check/5s = lots of operations.

Make /health cheap. Single in-process checks; no external dependencies.

A common mistake: health check goes through the same code paths as real traffic. Under load, real traffic queues; health checks queue too; they time out; the LB marks the backend unhealthy; traffic moves to other backends; they get even more loaded.

Cascading failure.

Better: health check on a separate code path, separate thread pool. Even under heavy real load, the health check responds.

Reflects what matters

A backend can have /health return 200 while actually serving 500s for real users. Health checks should test real things, not just “the process is running.”

But not too much. If /health tests every dependency, any one of them being momentarily unavailable marks the backend down. False positives.

The right level: critical dependencies (the database the backend can’t function without). Optional dependencies (a metrics service) shouldn’t fail the check.

Returns a meaningful response

A 200 means healthy. A 503 means actively unhealthy. But what about “degraded — could serve but at reduced capability”?

Some LBs support weighted health where partially-healthy backends get less traffic. For simpler setups, binary “healthy/unhealthy” is enough.

Anti-Patterns

Health check as DB connection test

/health does SELECT 1 against the database every check.

Result: the database gets hammered just by the LB checking. Real load on top of that pushes the DB over.

Better: cache the DB liveness check; only refresh every 30 seconds.

Health check that does work

“While I’m here, let me also process queued tasks!”

Result: health check timing varies based on queue depth. Under high load, the check times out; the backend is marked unhealthy.

No timeout

Health check waits forever for response. If the backend is hung, the LB doesn’t know.

Set a reasonable timeout (2-5 seconds typically).

Single threshold

“After 1 failure, mark unhealthy.” Network blips cause flapping.

Use thresholds (2-3 consecutive failures).

Same threshold up and down

Healthy after 1 pass; unhealthy after 1 fail. Bouncing.

Use hysteresis: e.g., 3 passes to mark healthy, 2 fails to mark unhealthy.

Cascading Failure Mitigation

The classic scenario:

Backend is busy under load.
Health check is slow because of load.
Times out; backend marked unhealthy.
Traffic moves to other backends.
They get even more loaded.
Their health checks slow down.
Marked unhealthy too.
Whole pool collapses.

Mitigations:

Panic mode

If more than X% of backends are marked unhealthy, ignore health checks. Distribute to all backends evenly. The idea: better to send some traffic to slow backends than no traffic to dead infrastructure.

AWS ALB does this by default. nginx has similar configurations.

Health check on separate path

Don’t share fate with real traffic.

Backend resilience

Backends should degrade gracefully. Better to serve 50% slower than to refuse all requests.

Circuit breakers internally

Inside the backend, circuit-break dependencies so a slow downstream doesn’t tank everything.

Real Production `/health` Endpoint

A pattern that works:

let cachedHealth = { status: 'ok', checkedAt: Date.now() }
const HEALTH_CHECK_INTERVAL = 30_000

setInterval(async () => {
    try {
        await Promise.all([
            db.query('SELECT 1').timeout(2000),
            cache.ping().timeout(1000),
        ])
        cachedHealth = { status: 'ok', checkedAt: Date.now() }
    } catch (e) {
        cachedHealth = { status: 'degraded', error: e.message, checkedAt: Date.now() }
    }
}, HEALTH_CHECK_INTERVAL)

app.get('/health', (req, res) => {
    if (Date.now() - cachedHealth.checkedAt > HEALTH_CHECK_INTERVAL * 3) {
        return res.status(503).json({ status: 'stale' })
    }
    if (cachedHealth.status === 'ok') return res.status(200).json(cachedHealth)
    return res.status(503).json(cachedHealth)
})

The background check runs every 30s. The /health endpoint returns the cached result. Endpoint is fast (no DB query per request); health is roughly real-time (30s lag at worst).

Multi-Tier Health Checks

For complex applications, two tiers:

/health — Shallow. Used by load balancer. Cheap. Returns 200 if the service is “able to handle traffic.”
/health/deep — Thorough. Used by monitoring. Tests all dependencies. Returns detailed status. Not called by the LB.

LB takes the shallow check; humans/monitoring take the deep check.

Connection Draining

When a backend is marked unhealthy:

Existing in-flight requests should complete.
New requests stop being routed to it.

This is called connection draining. Configurable in most LBs (typically 30-60 seconds).

For backends that handle long-lived connections (WebSockets, gRPC streams), draining is more nuanced. The backend should signal “no new streams” to clients so they reconnect to other backends.

Health Check Frequency

Trade-offs:

More frequent = faster detection of unhealthy backends. More overhead.
Less frequent = less overhead. Slower detection.

Typical settings:

Interval: 5-15 seconds.
Timeout: 2-5 seconds.
Healthy threshold: 2-3 consecutive passes.
Unhealthy threshold: 2-3 consecutive failures.

Faster checks for critical services; slower for stable ones. Cloud LBs have reasonable defaults.

Health Checks for Auto-Scaling

If you’re using auto-scaling alongside LB:

LB health checks decide which existing backends get traffic.
Auto-scaling health checks decide whether to terminate and replace backends.

These are separate. AWS has a specific “Load Balancer health checks” vs “EC2 health checks” distinction.

Be careful: an LB-unhealthy backend doesn’t necessarily need replacement if it’s recovering. Auto-scaling thresholds should be longer than LB thresholds.

DNS-Based Failover

For multi-region failover, DNS-based health checks (Route 53, Cloudflare Load Balancing) route traffic between regions:

Primary region’s load balancer is healthy → DNS returns primary IP.
Primary region down → DNS returns secondary IP.

DNS-level failover is slow (TTL-bound, typically 60 seconds). Anycast-based failover is faster.

See DNS caching explained for the propagation considerations.

Monitoring Health Check Health

A meta-concern: monitor your health checks themselves. If they fail in unexpected ways:

Suddenly more backends marked unhealthy than usual? Investigate.
All backends marked unhealthy at once? Probably a health-check bug or a network issue, not a backend problem.
Backends flipping repeatedly? Threshold tuning issue.

Dashboards showing health check pass rates over time catch these patterns.

TL;DR

Health checks decide which backends get traffic.
Make them cheap — they run constantly.
Don’t share fate with real traffic.
Use thresholds and hysteresis to prevent flapping.
Panic mode prevents cascading failures.
Shallow check for LB, deep check for monitoring — two tiers.
Cache background results in the /health endpoint.
Drain connections when marking unhealthy.
Auto-scaling and LB checks are separate concerns with separate thresholds.

Health checks are one of those small infrastructure details that has outsized impact on reliability. Done right, they’re invisible. Done wrong, they cause the outages they were supposed to prevent. For the broader load balancer context, see load balancer types; for the proxy layer often involved, reverse proxy explained.

Health Checks for Load Balancers: Designing Them Right

What Health Checks Do