Infrastructure health checks

Defaults and conventions around django-health-check.

Out of the box, we support health checks that verify:

request-response cycles function
the application can read and write from/to the database
there are no pending database migrations
the (default) cache is functioning

On an individual project basis, you can extend the application with additional checks and implement your own - see the upstream documentation.

Health check endpoints

The defaults included in maykin-common produce the following absolute URLs. The endpoints return an HTTP status code 200 if all is well, and 500 otherwise.

/_healthz/

Reports all the configured checks, such as database connection and permissions, configured caches and tests if all migrations have been executed. This is the most expensive check.

It is suitable for the Kubernetes startup probe.

/_healthz/livez/

The cheapest check - it does not check any dependencies like the database or caches. Extremely suitable for the the HTTP liveness probe (Kubernetes) or Docker engine health check.

/_healthz/readyz/

A check slightly more expensive than the liveness subset. It checks the database and default cache connections. Suitable for the readiness probe in Kubernetes.

Tip

Use the unspecified endpoint for the startup probe, livez for the liveness probe and Docker engine health check and the readyz subset for the readiness probe.

Alternatively, you can also use the livez endpoint for the readiness probe, and configure a higher failure treshold for the liveness probe to allow the application to recover before restarting it.

All health check endpoints return an overview of the checks that have passed or failed. The response format depends on the Accept header:

text/html: Human-readable overview (default for browsers)
application/json: Machine-readable format (for monitoring tools)

You can force JSON output by appending ?format=json to any health check URL.

Django setting defaults

The setting defaults should be imported from maykin_common.health_checks:

from maykin_common.health_checks import (
    default_health_check_apps,
    default_health_check_subsets,
)

INSTALLED_APPS = [
    ...,
    *default_health_check_apps,
    ...
]

HEALTH_CHECK = {
    "SUBSETS": default_health_check_subsets,
}

Define configuration defaults for Django project settings.

maykin_common.health_checks.defaults.default_health_check_apps: Sequence[str] = ['health_check']

The default health check app and plugins to enable.

This set of plugins is configured because they’re 99% guaranteed to be used in every project. Other contrib plugins are omitted because they require more configuration for which we cannot easily provide defaults.

See https://codingjoe.dev/django-health-check/install/ for more details.

Celery

class maykin_common.health_checks.celery.apps.CeleryHealthChecksAppConfig(app_name, app_module)

ready(): Override this method in subclasses to run code when Django starts.

class maykin_common.health_checks.celery.probes.EventLoopProbe(parent, **kwargs)

Checks that the Celery worker event loop is alive.

When the Celery worker starts, it starts the event loop, timer and processing pool. As a “final” step, it starts the Consumer blueprint, which is responsible for establishing the broker connection and actually start processing/consuming messages and tasks.

This event loop probe installs a bootstep when the timer is available, and schedules a periodic callback that touches a liveness file. If the timestamp of the last modified moment of the liveness file is too long again, we can conclude/assume that the event loop has crashed and the worker should be restarted, as it’s likely that ETA/countdown tasks and tasks in general are not being processed anymore by this worker. This makes no guarantees about actually being able to consume tasks or a live connection though. The timer runs in the main worker process (when using the preforking/multi-processing pool).

Celery itself should re-establish broker connection on connection loss, by restarting the Consumer blueprint, but bugs in Celery itself have been observed in the past. We can implement connectivity checks by pinging the worker from itself, which is set up elsewhere.

See the upstream documentation for details about blueprints and bootstep mechanisms.

Usage:

>>> app = Celery("my-project")
>>> app.steps["worker"].add(EventLoopProbe)

liveness_file: Path

name = 'maykin_common.health_checks.celery.probes.EventLoopProbe'

requires = {'celery.worker.components:Timer'}

start(parent: Worker)

stop(parent: Worker)

tref: Entry | None = None

maykin_common.health_checks.celery.probes.connect_beat_signals()

maykin_common.health_checks.celery.probes.connect_worker_signals()

maykin_common.health_checks.celery.probes.on_beat_init(*, sender: Service, **kwargs)

maykin_common.health_checks.celery.probes.on_beat_task_published(*, sender: str, routing_key: str, **kwargs)

Update the celery beat liveness every time a task is successfully published.

after_task_publish fires in the process that sent the task, so we must discern between the regular Django app that schedules tasks, and celery beat that also schedules tasks. We do this by tapping into the beat_init signal to mark the process as a beat process, and only touch the liveness file when running in beat.

maykin_common.health_checks.celery.probes.on_worker_ready(*, sender: Consumer, **kwargs): Create/touch the readiness file when the worker is ready to accept work.

maykin_common.health_checks.celery.probes.on_worker_shutdown(*, sender: Worker, **kwargs): Delete the readiness file when a worker shuts down.

Settings

maykin_common.settings.MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE