.. _health_checks: ============= Health checks ============= Health checks are used to mark (crashed) containers as unhealthy, so that they can be restarted by the container orchestration (Kubernetes, Docker engine...). The health check tooling in maykin-common covers the HTTP health checks for the Django app and the Celery components, like the worker and beat. .. versionadded:: 0.13.0 Added tooling for Django health checks. .. versionadded:: 0.14.0 Added support for Celery Beat. .. contents:: Jump to :local: :depth: 2 There is also :ref:`reference documentation ` available. Quickstart (tl;dr) ================== Install the extra dependencies: .. code-block:: bash uv pip install maykin-common[health-checks] Update your settings accordingly: .. code-block:: python from maykin_common.health_checks import default_health_check_apps INSTALLED_APPS = [ ..., *default_health_check_apps, "maykin_common.health_checks.celery", # optional, add if you use Celery ... ] and your root ``urls.py``: .. code-block:: python urlpatterns = [ ..., path("", include("maykin_common.health_checks.urls")), ..., ] and in your ``celery.py`` entrypoint: .. code-block:: python from maykin_common.health_checks.celery.probes import EventLoopProbe app = Celery("my-proj") app.steps["worker"].add(EventLoopProbe) See the :ref:`health_checks_cli` details for how to test the health. Celery ====== If you use Celery in your project, there are health check tools for the Celery components too. Beat ---- We can monitor Beat's liveness by tracking when was the last time a task was scheduled. Instrumentation is done by adding a Django app and (optionally) specifying the file path to the liveness file: .. code-block:: python from pathlib import Path INSTALLED_APPS = [ ..., "maykin_common.health_checks.celery", ... ] MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE = Path("/tmp") / "celery_beat_live" The file specified through ``MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE`` will be touched every time Beat successfully schedules a task to the broker. The health check can then test how long ago the file was last touched. For example, if your Beat schedule runs a task every hour, you could run the health check that expects the file to be modified less than 2 hours ago. .. tip:: If your normal schedule has very infrequent tasks (e.g. once per week), you may want to set up a smoke test task that runs more frequently (e.g. every hour). .. caution:: If you use the :class:`django_celery_beat.schedulers.DatabaseScheduler` scheduler, you should be aware that your schedules are editable at runtime through the admin, which may mean your ``max-age`` parameter no longer aligns with your actual schedule, leading to erroneously failed health checks. In such cases you might want to consider scheduling an explicit heartbeat task, using the task description to make clear that the task should be enabled and should not be re-scheduled. Worker ------ Monitoring Celery Worker's health is complicated due to the complex nature of workers and how they can fail. The worker system essentially boots up a whole stack of subsystems that can each fail and contribute to "broken" workers. For details, see the `blueprints `_ docs. **Enabling the checks** Enabling the worker health check machinery requires some small changes to your configuration. First, ensure the settings are configured appropriately: .. code-block:: python from pathlib import Path INSTALLED_APPS = [ ..., "maykin_common.health_checks.celery", ... ] # optional settings, the defaults are listed MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_PROBE_FREQUENCY_SECONDS = 60 MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_LIVENESS_FILE = Path("/tmp") / "celery_worker_event_loop_live" MKN_HEALTH_CHECKS_WORKER_READINESS_FILE = Path("/tmp") / "celery_worker_ready" This installs the signal receiver for the worker ready/shutdown, wich affects the presence of ``MKN_HEALTH_CHECKS_WORKER_READINESS_FILE``. Next, in your Celery entrypoint (where you define ``app = Celery("my-celery-app")``), add the bootstep: .. code-block:: python :linenos: :emphasize-lines: 1,4 from maykin_common.health_checks.celery.probes import EventLoopProbe app = Celery("my-project") app.steps["worker"].add(EventLoopProbe) app.autodiscover_tasks() Which sets up the event-loop monitoring and affects the last-modified timestamp of the ``MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_LIVENESS_FILE``. **Background information** The health-check tooling we ship hooks into some critical phases: #. Worker starts #. Event loop is started, including the timer <-- we hook into the timer to check the event loop #. Consumer starts #. Consumer establishes broker connection <-- we test the connection by pinging #. Consumer starts consuming tasks #. Consumer is ready to process tasks <-- we hook into this signal Celery's machinery is set up so that the whole consumer subsystem restarts on connection loss, which should make it recover gracefully without restarting the whole worker process. **What we monitor** * Event loop liveness. Periodically, we touch a heartbeat file that shows the event loop is still live and able to orchestrate work. The frequency can be tweaked with the ``MKN_HEALTH_CHECKS_WORKER_EVENT_LOOP_PROBE_FREQUENCY_SECONDS`` setting. * Broker connection health, by sending a PING roundtrip from the worker process to itself. This is reliable (especially with the default preforking pool) because the ping control machinery lives in the main process, not in the worker processes that actually execute tasks, so it cannot be blocked by long-running tasks. * Worker readiness - a readiness file is created when the worker is ready to start processing tasks. It is deleted again when the worker shuts down. **What we don't monitor** * Actual tasks execution - while it's possible in theory to create a worker-specific queue and schedule a task from itself to the worker, this brings a lot of additional uncertainty and complexity: * the task can be held up by other long-running tasks, because it will execute in a worker process * the queue name must contain the worker (host) name, which creates many keys in the broker. For brokers like Redis, these keys don't automatically expire and pollute the system. So, a periodic task and additional machinery are needed to detect stale exchanges and clean them up. * Desired concurrency is available. Some tests showed that Celery seems perfectly capable of restarting crashed processes and maintains the desired number of worker processes. Unless this becomes an observed failure mode, these sort of things will not be added. .. _health_checks_cli: Command line ============ HTTP health checks ------------------ You can use the ``maykin-common`` CLI to probe the health check endpoint(s): .. code-block:: bash maykin-common health-check --endpoint=/_healthz/livez/ Which will exit with exit code ``0`` for success responses (HTTP status code between 200 and 399). .. note:: Make sure the install the ``cli`` extra: .. code-block:: bash uv pip install maykin-common[cli] Celery beat health checks ------------------------- If you use Celery Beat and install the health checks (see above), you can test the Beat liveness too: .. code-block:: bash maykin-common beat-health-check --file /tmp/celery_beat_live --max-age 120 Which will exit with exit code ``0`` if the specified file exists and is last modified within the specified number of seconds. The file path should match the value of the ``MKN_HEALTH_CHECKS_BEAT_LIVENESS_FILE`` setting. .. tip:: Use a ``--max-age`` that's 2x the interval of your most frequently scheduled task. E.g. if you have a task that runs every minute, pick 120 seconds. .. tip:: Use startup probes if possible to give Beat time to load, start and schedule a task for the first time. Celery worker health checks --------------------------- If you use Celery and install the health checks (see above), you can test their health: .. code-block:: bash maykin-common worker-health-check \ --no-skip-event-loop-liveness \ --liveness-file /tmp/celery_worker_event_loop_live \ --max-age 70 \ --no-skip-ping \ --broker redis://localhost:6379/0 \ --worker-name celery@localhost \ --ping-timeout 3 \ --skip-readiness The example options closely match the defaults. .. tip:: Ensure you invoke this command in the worker container itself. .. tip:: Use startup probes if possible to give the worker time to load, create the health check files and establish a broker connection. The command tests different aspects and will exit with a non-zero exit code when any of the tests fails. Event loop liveness ~~~~~~~~~~~~~~~~~~~ Default: enabled. The event loop liveness test checks that the last modified timestamp of ``--liveness-file`` is not older than ``--max-age``. If the event loop/timer crashes, then the liveness file will not be touched any more and eventually become older than the provided max age. By default, the event loop file is touched every minute, so the default max age accounts for some potential time drift. Ping ~~~~ Default: enabled. The ping roundtrip sends a ping from the worker to itself, which travels via the broker connection. Ping failures detect potential broker connection issues which definitely result in the worker not being able to pick up results. The ping check requires some routing information. ``--broker`` The address of the broker, matching the ``CELERY_BROKER`` setting. Ping needs to connect to the broker to send the control message. In container contexts, you will typically use a service or container name, e.g. ``redis://my-redis:6379/0/`` that uses DNS resolution. .. tip:: The ``localhost`` default points to the container itself. Make sure to provide this option explicitly. ``--worker-name`` The worker name is taken from the envvar ``CELERY_WORKER_NAME`` if set. In containerized environments, the worker name usually has the shape ``@``, where the default queue is usually named ``celery``, but projects can define dedicated queues/workers for queues. The host name is usually taken from the container ``hostname`` and matches either the container name on Docker engine or the pod name on Kubernetes. Example worker names: * ``celery@sparky`` * ``celery@my-project-client-test-celery-1`` * ``long-running@my-project-client-test-celery-1`` * ``celery@celery-worker-554b9c67f9-c5cv4`` .. note:: Ping requires some additional information to keep the health check lightweight without loading the entire application in memory, as that itself can take multiple seconds and can fail probe timeouts. Readiness ~~~~~~~~~ Default: disabled. The readiness test checks that the readiness file exists. It is created when the worker signals that it's ready to start processing tasks, at the end of the startup phase. It is deleted when the worker shuts down. Absence of the readiness file can indicate that the worker failed to load the application code. On Kubernetes with rolling deployments, you probably want to add this as a readiness probe to prevent old pods from being stopped when the new version is broken. Recommended usage in a readiness probe: .. code-block:: bash maykin-common worker-health-check \ --skip-event-loop-liveness \ --skip-ping \ --readiness-file /tmp/celery_worker_ready