Health checks
Health checks are used to mark (crashed) containers as unhealthy, so that they can be restarted by the container orchestration (Kubernetes, Docker engine…).
The health check tooling in maykin-common covers the HTTP health checks for the Django app and the Celery components, like the worker and beat.
Added in version 0.13.0: Added tooling for Django health checks.
Added in version 0.14.0: Added support for Celery Beat.
There is also reference documentation available.
Quickstart (tl;dr)
Install the extra dependencies:
uv pip install maykin-common[health-checks]
Update your settings accordingly:
from maykin_common.health_checks import default_health_check_apps
INSTALLED_APPS = [
...,
*default_health_check_apps,
"maykin_common.health_checks.celery", # optional, add if you use Celery
...
]
and your root urls.py:
urlpatterns = [
...,
path("", include("maykin_common.health_checks.urls")),
...,
]
and in your celery.py entrypoint:
from maykin_common.health_checks.celery.probes import EventLoopProbe
app = Celery("my-proj")
app.steps["worker"].add(EventLoopProbe)
See the Command line details for how to test the health.
Celery
If you use Celery in your project, there are health check tools for the Celery components too.
Beat
We can monitor Beat’s liveness by tracking when was the last time a task was scheduled. Instrumentation is done by adding a Django app and (optionally) specifying the file path to the liveness file:
from pathlib import Path
INSTALLED_APPS = [
...,
"maykin_common.health_checks.celery",
...
]
MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE = Path("/tmp") / "celery_beat_live"
The file specified through MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE will be touched
every time Beat successfully schedules a task to the broker. The health check can then
test how long ago the file was last touched. For example, if your Beat schedule runs a
task every hour, you could run the health check that expects the file to be modified
less than 2 hours ago.
Tip
If your normal schedule has very infrequent tasks (e.g. once per week), you may want to set up a smoke test task that runs more frequently (e.g. every hour).
Caution
If you use the django_celery_beat.schedulers.DatabaseScheduler scheduler,
you should be aware that your schedules are editable at runtime through the admin,
which may mean your max-age parameter no longer aligns with your actual schedule,
leading to erroneously failed health checks. In such cases you might want to consider
scheduling an explicit heartbeat task, using the task description to make clear that
the task should be enabled and should not be re-scheduled.
Worker
Monitoring Celery Worker’s health is complicated due to the complex nature of workers and how they can fail. The worker system essentially boots up a whole stack of subsystems that can each fail and contribute to “broken” workers. For details, see the blueprints docs.
Enabling the checks
Enabling the worker health check machinery requires some small changes to your configuration.
First, ensure the settings are configured appropriately:
from pathlib import Path
INSTALLED_APPS = [
...,
"maykin_common.health_checks.celery",
...
]
# optional settings, the defaults are listed
MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_PROBE_FREQUENCY_SECONDS = 60
MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_LIVENESS_FILE = Path("/tmp") / "celery_worker_event_loop_live"
MKN_HEALTH_CHECKS_WORKER_READINESS_FILE = Path("/tmp") / "celery_worker_ready"
This installs the signal receiver for the worker ready/shutdown, wich affects the
presence of MKN_HEALTH_CHECKS_WORKER_READINESS_FILE.
Next, in your Celery entrypoint (where you define app = Celery("my-celery-app")),
add the bootstep:
1from maykin_common.health_checks.celery.probes import EventLoopProbe
2
3app = Celery("my-project")
4app.steps["worker"].add(EventLoopProbe)
5app.autodiscover_tasks()
Which sets up the event-loop monitoring and affects the last-modified timestamp of the
MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_LIVENESS_FILE.
Background information
The health-check tooling we ship hooks into some critical phases:
Worker starts
Event loop is started, including the timer <– we hook into the timer to check the event loop
Consumer starts
Consumer establishes broker connection <– we test the connection by pinging
Consumer starts consuming tasks
Consumer is ready to process tasks <– we hook into this signal
Celery’s machinery is set up so that the whole consumer subsystem restarts on connection loss, which should make it recover gracefully without restarting the whole worker process.
What we monitor
Event loop liveness. Periodically, we touch a heartbeat file that shows the event loop is still live and able to orchestrate work. The frequency can be tweaked with the
MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_PROBE_FREQUENCY_SECONDSsetting.Broker connection health, by sending a PING roundtrip from the worker process to itself. This is reliable (especially with the default preforking pool) because the ping control machinery lives in the main process, not in the worker processes that actually execute tasks, so it cannot be blocked by long-running tasks.
Worker readiness - a readiness file is created when the worker is ready to start processing tasks. It is deleted again when the worker shuts down.
What we don’t monitor
Actual tasks execution - while it’s possible in theory to create a worker-specific queue and schedule a task from itself to the worker, this brings a lot of additional uncertainty and complexity:
the task can be held up by other long-running tasks, because it will execute in a worker process
the queue name must contain the worker (host) name, which creates many keys in the broker. For brokers like Redis, these keys don’t automatically expire and pollute the system. So, a periodic task and additional machinery are needed to detect stale exchanges and clean them up.
Desired concurrency is available. Some tests showed that Celery seems perfectly capable of restarting crashed processes and maintains the desired number of worker processes. Unless this becomes an observed failure mode, these sort of things will not be added.
Command line
HTTP health checks
You can use the maykin-common CLI to probe the health check endpoint(s):
maykin-common health-check --endpoint=/_healthz/livez/
Which will exit with exit code 0 for success responses (HTTP status code between 200
and 399).
Note
Make sure the install the cli extra:
uv pip install maykin-common[cli]
Celery beat health checks
If you use Celery Beat and install the health checks (see above), you can test the Beat liveness too:
maykin-common beat-health-check --file /tmp/celery_beat_live --max-age 120
Which will exit with exit code 0 if the specified file exists and is last modified
within the specified number of seconds. The file path should match the value of the
MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE setting.
Tip
Use a --max-age that’s 2x the interval of your most frequently scheduled
task. E.g. if you have a task that runs every minute, pick 120 seconds.
Tip
Use startup probes if possible to give Beat time to load, start and schedule a task for the first time.
Celery worker health checks
If you use Celery and install the health checks (see above), you can test their health:
maykin-common worker-health-check \
--no-skip-event-loop-liveness \
--liveness-file /tmp/celery_worker_event_loop_live \
--max-age 70 \
--no-skip-ping \
--broker redis://localhost:6379/0 \
--worker-name celery@localhost \
--ping-timeout 3 \
--skip-readiness
The example options closely match the defaults.
Tip
Ensure you invoke this command in the worker container itself.
Tip
Use startup probes if possible to give the worker time to load, create the health check files and establish a broker connection.
The command tests different aspects and will exit with a non-zero exit code when any of the tests fails.
Event loop liveness
Default: enabled.
The event loop liveness test checks that the last modified timestamp of
--liveness-file is not older than --max-age. If the event loop/timer crashes,
then the liveness file will not be touched any more and eventually become older than
the provided max age.
By default, the event loop file is touched every minute, so the default max age accounts for some potential time drift.
Ping
Default: enabled.
The ping roundtrip sends a ping from the worker to itself, which travels via the broker connection. Ping failures detect potential broker connection issues which definitely result in the worker not being able to pick up results.
The ping check requires some routing information.
--brokerThe address of the broker, matching the
CELERY_BROKERsetting. Ping needs to connect to the broker to send the control message. In container contexts, you will typically use a service or container name, e.g.redis://my-redis:6379/0/that uses DNS resolution.Tip
The
localhostdefault points to the container itself. Make sure to provide this option explicitly.--worker-nameThe worker name is taken from the envvar
CELERY_WORKER_NAMEif set. In containerized environments, the worker name usually has the shape<queue>@<host>, where the default queue is usually namedcelery, but projects can define dedicated queues/workers for queues.The host name is usually taken from the container
hostnameand matches either the container name on Docker engine or the pod name on Kubernetes.Example worker names:
celery@sparkycelery@my-project-client-test-celery-1long-running@my-project-client-test-celery-1celery@celery-worker-554b9c67f9-c5cv4
Note
Ping requires some additional information to keep the health check lightweight without loading the entire application in memory, as that itself can take multiple seconds and can fail probe timeouts.
Readiness
Default: disabled.
The readiness test checks that the readiness file exists. It is created when the worker signals that it’s ready to start processing tasks, at the end of the startup phase. It is deleted when the worker shuts down.
Absence of the readiness file can indicate that the worker failed to load the application code. On Kubernetes with rolling deployments, you probably want to add this as a readiness probe to prevent old pods from being stopped when the new version is broken.
Recommended usage in a readiness probe:
maykin-common worker-health-check \ --skip-event-loop-liveness \ --skip-ping \ --readiness-file /tmp/celery_worker_ready