Health checks

Health checks are used to mark (crashed) containers as unhealthy, so that they can be restarted by the container orchestration (Kubernetes, Docker engine…).

The health check tooling in maykin-common covers the HTTP health checks for the Django app and the Celery components, like the worker and beat.

Added in version 0.13.0: Added tooling for Django health checks.

Added in version 0.14.0: Added support for Celery Beat.

There is also reference documentation available.

Quickstart (tl;dr)

Install the extra dependencies:

uv pip install maykin-common[health-checks]

Update your settings accordingly:

from maykin_common.health_checks import default_health_check_apps

INSTALLED_APPS = [
    ...,
    *default_health_check_apps,
    "maykin_common.health_checks.celery",  # optional, add if you use Celery
    ...
]

and your root urls.py:

urlpatterns = [
    ...,
    path("", include("maykin_common.health_checks.urls")),
    ...,
]

and in your celery.py entrypoint:

from maykin_common.health_checks.celery.probes import EventLoopProbe

app = Celery("my-proj")
app.steps["worker"].add(EventLoopProbe)

See the Command line details for how to test the health.

Celery 

If you use Celery in your project, there are health check tools for the Celery components too.

Beat 

We can monitor Beat’s liveness by tracking when was the last time a task was scheduled. Instrumentation is done by adding a Django app and (optionally) specifying the file path to the liveness file:

from pathlib import Path

INSTALLED_APPS = [
    ...,
    "maykin_common.health_checks.celery",
    ...
]

MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE = Path("/tmp") / "celery_beat_live"

The file specified through MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE will be touched every time Beat successfully schedules a task to the broker. The health check can then test how long ago the file was last touched. For example, if your Beat schedule runs a task every hour, you could run the health check that expects the file to be modified less than 2 hours ago.

Tip

If your normal schedule has very infrequent tasks (e.g. once per week), you may want to set up a smoke test task that runs more frequently (e.g. every hour).

Caution

If you use the django_celery_beat.schedulers.DatabaseScheduler scheduler, you should be aware that your schedules are editable at runtime through the admin, which may mean your max-age parameter no longer aligns with your actual schedule, leading to erroneously failed health checks. In such cases you might want to consider scheduling an explicit heartbeat task, using the task description to make clear that the task should be enabled and should not be re-scheduled.

Worker 

Monitoring Celery Worker’s health is complicated due to the complex nature of workers and how they can fail. The worker system essentially boots up a whole stack of subsystems that can each fail and contribute to “broken” workers. For details, see the blueprints docs.

Enabling the checks

Enabling the worker health check machinery requires some small changes to your configuration.

First, ensure the settings are configured appropriately:

from pathlib import Path

INSTALLED_APPS = [
    ...,
    "maykin_common.health_checks.celery",
    ...
]

# optional settings, the defaults are listed
MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_PROBE_FREQUENCY_SECONDS = 60
MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_LIVENESS_FILE = Path("/tmp") / "celery_worker_event_loop_live"
MKN_HEALTH_CHECKS_WORKER_READINESS_FILE = Path("/tmp") / "celery_worker_ready"

This installs the signal receiver for the worker ready/shutdown, wich affects the presence of MKN_HEALTH_CHECKS_WORKER_READINESS_FILE.

Next, in your Celery entrypoint (where you define app = Celery("my-celery-app")), add the bootstep:

from maykin_common.health_checks.celery.probes import EventLoopProbe

app = Celery("my-project")
app.steps["worker"].add(EventLoopProbe)
app.autodiscover_tasks()

Which sets up the event-loop monitoring and affects the last-modified timestamp of the MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_LIVENESS_FILE.

Background information

The health-check tooling we ship hooks into some critical phases:

Worker starts
Event loop is started, including the timer <– we hook into the timer to check the event loop
Consumer starts
Consumer establishes broker connection <– we test the connection by pinging
Consumer starts consuming tasks
Consumer is ready to process tasks <– we hook into this signal

Celery’s machinery is set up so that the whole consumer subsystem restarts on connection loss, which should make it recover gracefully without restarting the whole worker process.

What we monitor

Event loop liveness. Periodically, we touch a heartbeat file that shows the event loop is still live and able to orchestrate work. The frequency can be tweaked with the MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_PROBE_FREQUENCY_SECONDS setting.
Broker connection health, by sending a PING roundtrip from the worker process to itself. This is reliable (especially with the default preforking pool) because the ping control machinery lives in the main process, not in the worker processes that actually execute tasks, so it cannot be blocked by long-running tasks.
Worker readiness - a readiness file is created when the worker is ready to start processing tasks. It is deleted again when the worker shuts down.

What we don’t monitor

Actual tasks execution - while it’s possible in theory to create a worker-specific queue and schedule a task from itself to the worker, this brings a lot of additional uncertainty and complexity:
- the task can be held up by other long-running tasks, because it will execute in a worker process
- the queue name must contain the worker (host) name, which creates many keys in the broker. For brokers like Redis, these keys don’t automatically expire and pollute the system. So, a periodic task and additional machinery are needed to detect stale exchanges and clean them up.
Desired concurrency is available. Some tests showed that Celery seems perfectly capable of restarting crashed processes and maintains the desired number of worker processes. Unless this becomes an observed failure mode, these sort of things will not be added.

Command line 

HTTP health checks 

You can use the maykin-common CLI to probe the health check endpoint(s):

maykin-common health-check --endpoint=/_healthz/livez/

Which will exit with exit code 0 for success responses (HTTP status code between 200 and 399).

Note

Make sure the install the cli extra:

uv pip install maykin-common[cli]

Celery beat health checks 

If you use Celery Beat and install the health checks (see above), you can test the Beat liveness too:

maykin-common beat-health-check --file /tmp/celery_beat_live --max-age 120

Which will exit with exit code 0 if the specified file exists and is last modified within the specified number of seconds. The file path should match the value of the MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE setting.

Tip

Use a --max-age that’s 2x the interval of your most frequently scheduled task. E.g. if you have a task that runs every minute, pick 120 seconds.

Tip

Use startup probes if possible to give Beat time to load, start and schedule a task for the first time.

Celery worker health checks 

If you use Celery and install the health checks (see above), you can test their health:

maykin-common worker-health-check \
    --no-skip-event-loop-liveness \
    --liveness-file /tmp/celery_worker_event_loop_live \
    --max-age 70 \
    --no-skip-ping \
    --broker redis://localhost:6379/0 \
    --worker-name celery@localhost \
    --ping-timeout 3 \
    --skip-readiness

The example options closely match the defaults.

Tip

Ensure you invoke this command in the worker container itself.

Tip

Use startup probes if possible to give the worker time to load, create the health check files and establish a broker connection.

The command tests different aspects and will exit with a non-zero exit code when any of the tests fails.

Event loop liveness

Default: enabled.

The event loop liveness test checks that the last modified timestamp of --liveness-file is not older than --max-age. If the event loop/timer crashes, then the liveness file will not be touched any more and eventually become older than the provided max age.

By default, the event loop file is touched every minute, so the default max age accounts for some potential time drift.

Ping

Default: enabled.

The ping roundtrip sends a ping from the worker to itself, which travels via the broker connection. Ping failures detect potential broker connection issues which definitely result in the worker not being able to pick up results.

The ping check requires some routing information.

--broker

The address of the broker, matching the CELERY_BROKER setting. Ping needs to connect to the broker to send the control message. In container contexts, you will typically use a service or container name, e.g. redis://my-redis:6379/0/ that uses DNS resolution.

Tip

The localhost default points to the container itself. Make sure to provide this option explicitly.

--worker-name

The worker name is taken from the envvar CELERY_WORKER_NAME if set. In containerized environments, the worker name usually has the shape <queue>@<host>, where the default queue is usually named celery, but projects can define dedicated queues/workers for queues.

The host name is usually taken from the container hostname and matches either the container name on Docker engine or the pod name on Kubernetes.

Example worker names:

celery@sparky
celery@my-project-client-test-celery-1
long-running@my-project-client-test-celery-1
celery@celery-worker-554b9c67f9-c5cv4

Note

Ping requires some additional information to keep the health check lightweight without loading the entire application in memory, as that itself can take multiple seconds and can fail probe timeouts.

Readiness

Default: disabled.

The readiness test checks that the readiness file exists. It is created when the worker signals that it’s ready to start processing tasks, at the end of the startup phase. It is deleted when the worker shuts down.

Absence of the readiness file can indicate that the worker failed to load the application code. On Kubernetes with rolling deployments, you probably want to add this as a readiness probe to prevent old pods from being stopped when the new version is broken.