.. _otel: ============== Open Telemetry ============== .. versionadded:: 0.7.0 Added tooling for open telemetry. `Open Telemetry`_ (OTEL) is a framework and standard to collect telemetry data from systems with the goal of making them observable. .. contents:: Jump to :local: :depth: 2 Telemetry data covers a number of `signals `_, of which we focus on three kinds: * **logs** - allow you to reconstruct what and why something happened * **metrics** - collect measurements from interesting state of the system * **traces** - correlate units of work and establish there parent/child relationship, even across system boundaries Open Telemetry has *things to say* about all this data. .. note:: Currently, we don't use Open Telemetry tooling for logs, but projects typically set up `structlog `_ which get scraped and persisted in some monitoring backend. In the future, we will include the helpers for this in maykin-common. Quickstart (tl;dr) ================== Install the extra dependencies: .. code-block:: bash uv pip install maykin-common[otel] Call the initialization code: .. code-block:: python :linenos: :caption: src/my_awesome_project/setup.py :emphasize-lines: 4,10-17 import os import warnings from maykin_common.otel import setup_otel def setup_env(): load_dotenv(...) os.environ.setdefault("DJANGO_SETTINGS_MODULE", ...) if "OTEL_SERVICE_NAME" not in os.environ: warnings.warn( "No OTEL_SERVICE_NAME environment variable set, using a default. " "You should set a (distinct) value for each component (web, worker...)", RuntimeWarning, stacklevel=2, ) os.environ.setdefault("OTEL_SERVICE_NAME", "my-awesome-project") setup_otel() # other initialization... See the :ref:`otel_best_practices` for more information about the ``OTEL_SERVICE_NAME`` environment variable usage. .. note:: Seeing this working is not obvious - telemetry sending needs to be enabled ( it's initially disabled in default-project) *and* you need a sink to send the OTLP data to. Check default-project or the Maykin docs with some reference instructions if your project has not been prepared yet. If you use Celery with process pool (the default), then your worker invocation must set ``_OTEL_DEFER_SETUP=True`` in the environment to defer the initialization until the worker process has forked. Python Open Telemetry SDK ========================= :func:`maykin_common.otel.setup_otel` calls the setup functions from the `python SDK `_. The toolchain is roughly comprised of two core packages + some extensions: * ``opentelemetry-api`` - for library authors, foundation for the SDK * ``opentelemetry-sdk`` - the concrete implementations and project-specific integrations ``maykin-common`` uses the SDK package to provide an opinionated, pre-configured ready to use setup. You should not have a need to override this. If/when we define metrics in other modules of maykin-common, you can only use the API exposed from ``opentelemetry-api``. Usage of the ``opentelemetry-sdk`` package is restricted to the :mod:`maykin_common.otel` module. There are additional contrib packages with library/framework specific instrumentation, like the ``opentelemetry-instrumentation-django`` package. This has all been pre-configured in maykin-common. The `examples `__ documentation can be interesting. Defining metrics ---------------- By default, the wsgi instrumentation (set up via the Django instrumentation) captures spans of request/response cycles. It also captures request metrics, like the duration, annotated with context like the path, method etc. Application developers can provide a lot of extra value by defining and tracking their application-specific metrics, because you have the context of the application and know which data/information is interesting. Defining and using a metric is pretty straightforward: .. code-block:: python :linenos: :emphasize-lines: 4,6-9,16 # in metrics.py from opentelemetry import metrics meter = metrics.get_meter("my_awesome_project.my_module") export_counter = meter.create_counter( "exports", description="The number of exports triggered by users", ) # in views.py from .metrics import export_counter def export(request, pk: int): export_counter.add(1, {"pk": pk, "user": request.user.username}) return _create_export(pk=pk) .. warning:: Resist the temptation to use ``__name__`` for the meter definitions! See :external:meth:`opentelemetry.sdk.metrics.MeterProvider.get_meter` .. note:: Other packages that we maintain can also opt-in to defining and tracking metrics in the future. Tracing ------- :func:`maykin_common.otel.setup_otel` calls opentelemetry instrumentors which automatically add traces for Django requests, Redis and PostgreSQL queries, Celery tasks and external HTTP requests performed with the ``requests`` library. It's also possible to add manual spans for any part of the code. Opentelemetry provides a context manager ``tracer.start_as_current_span`` for it: .. code-block:: python :linenos: from opentelemetry import trace tracer = trace.get_tracer("my_awesome_project.my_module") def do_work(): print("doing work outside of the span") with tracer.start_as_current_span("span-name") as span: print("doing some work, that span will track") print("doing more work outside of the span") It's also possible to use ``tracer.start_as_current_span`` as a decorator: .. code-block:: python :linenos: from opentelemetry import trace tracer = trace.get_tracer("my_awesome_project.my_module") @tracer.start_as_current_span("span-name") def do_work(): print("doing some work, that span will track") The Opentelemetry documentation provides more `examples `_ how to create spans. .. _otel_best_practices: Best practices ============== **Service name vs. deployment environment** Don't put the deployment target (prod, acc, test...) in the service name, as that leads to higher cardinality labels which has a negative impact on storage and query performance. Instead, make sure to properly define the ``ENVIRONMENT`` Django setting, which is also used by our Sentry SDK initialisation. **Use different service names for different logical units** The Django application (deployed with uwsgi, for example) is a different logical unit than the celery worker processing background tasks. In fact, even different task queues (e.g. high/low prio) are different units, and deserve their own easy-to-identify service name. .. tip:: Define ``OTEL_SERVICE_NAME`` as environment variable in the entrypoint shell scripts like ``bin/docker_start.sh`` and ``bin/celery_worker.sh``: .. code-block:: bash :caption: bin/celery_worker.sh QUEUE=${CELERY_WORKER_QUEUE:=celery} WORKER_NAME=${CELERY_WORKER_NAME:="${QUEUE}"@%n} # Set defaults for OTEL : "${OTEL_SERVICE_NAME:=my-project-worker-"${QUEUE}"}" Suggested names to encourage consistency: * ``my-project`` - the django project that responds to HTTP requests * ``my-project-worker-celery``, ``my-project-worker-highprio`` - each (dedicated) celery worker queue. If you have different queues set up, each one is typically its own service * ``my-project-flower`` - the celery monitoring service * ``my-project-scheduler`` - the celery beat task scheduler **Extract resource attributes for containers** Usually our applications are deployed in one of two ways: * on Kubernetes * on (virtual) servers with Docker engine For the docker engine case, we can extract additional resource attributes by setting ``_OTEL_ENABLE_CONTAINER_RESOURCE_DETECTOR=true``. Don't do this on Kubernetes, as it may lead to conflicting information. On Kubernetes, the recommendation is to enable the k8sattributeprocessor_ when deploying the :ref:`Collector `. **Authentication** The Collector may be API key or username/password protected. In that case, you can pass additional headers via the standardized environment variable: .. code-block:: bash OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic " Architecture ============ The essence is simple: instrumented *services* produce telemetry data that gets *exported* to a telemetry *receiver* which ensures the data gets *persisted*. Visualisation and monitoring tooling queries the telemetry data, making the service observable and provides (automated) alerting options. We have made some decisions at the library level that correspond to the following diagram: .. code-block:: none +----------------+ | metrics time | | series storage | >---+ +----------------+ | +-----------+ telemetry ^ | pull/query | Service A |-------------+ / | +-----------+ | / | | +----------------+ | +-> | | +---------------+ | +------------+ | OTel Collector |---> | spans storage | >---+---| Dashboards | +-> | | +---------------+ | +------------+ | +----------------+ | +-----------+ telemetry | \ | | Service B |-------------+ \ | +-----------+ v | +--------------+ | | logs storage | >-------+ +--------------+ Services -------- The services are the applications producing telemetry data. They can be different projects that each depend on ``maykin_common[otel]``, but they can also be different aspects of the same project - see the :ref:`otel_best_practices` about different service names. :func:`maykin_common.otel.setup_otel` sets up the application so that the produced telemetry data gets exported using the OTLP_ protocol. Telemetry gets pushed over ``gRPC`` or ``http/protobuf`` to an endpoint that can receive OTLP data. .. _otel_architecture_collector: Open Telemetry receiver ----------------------- The receivers are applications deployed/running somewhere that can accept telemetry data in the OTLP format. They receive the telemetry from the services. `Open Telemetry Collector`_ is a vendor-agnostic software that can receive, process and export telemetry data. It does not have a storage of its own, but instead exports the telemetry data according to configuration parameters. The collector is not a hard requirement - many storage backends support ingesting OTLP data directly, but having a centralised collector is very convenient and simplifies the service configuration. Storage ------- The storage backends are applications that can receive and persist the telemetry data. Typically, you can configure retention periods, and they used optimized databases for the nature of the telemetry data. They're usually also the applications that expose a query interface for the visualization tooling. Different vendors typically compete with each other at this level. Some well known examples are: * Prometheus, InfluxDB, Datadog, Splunk for time-series data (typically metrics) * Loki, Signoz, Logtail, Datadog, Splunk for logs * Jaeger, Elastic APM, Tempo, Datadog, Splunk for distributed traces Commercial offerings typically provide an all-in-one solution for all types of telemetry. Dashboards/visualisation/alerting --------------------------------- Software like Grafana and Kibana specialize in querying and displaying observability data. Typically you can define dashboards with visualisations to explore the data that was ingested. This is typically done by defining queries (in ``promql`` for Prometheus, ``logql`` for Loki etc.) which filter on labels of telemetry data (e.g. show only metrics from production and exclude test/acceptance environments) and may combine different metrics even, ultimately leading to easy-to-understand graphs to see what the state of the system is/was. Troubleshooting =============== Combining all this with pre-forking application servers like uwsgi and gunicorn is a challenge. Some issues were encountered and the code has been adapted for use with ``uwsgi``, but we can't guarantee that all uwsgi configuration options will work out of the box. * ``--py-call-uwsgi-fork-hooks`` has been observed causing segfaults, even though this is recommended/required by the Sentry SDK (which it only uses for its telemetry features so *we think* it can be ignored) * ``--lazy-apps`` has been observed in the OTel setup not being executed. It's possible that the ``@postfork`` is mutually exclusive with ``--lazy-apps``. * Calling an instrumenter (``SomeInstrumentor().instrument()``) in the postfork hook can lead to no metrics being collected at all, which *looks* as if it's an exporter problem. .. _Open Telemetry: https://opentelemetry.io/ .. _OTLP: https://opentelemetry.io/docs/specs/otlp/ .. _Open Telemetry Collector: https://opentelemetry.io/docs/collector/ .. _k8sattributeprocessor: https://opentelemetry.io/docs/platforms/kubernetes/collector/components/#kubernetes-attributes-processor