Open Telemetry

Added in version 0.7.0: Added tooling for open telemetry.

Open Telemetry (OTEL) is a framework and standard to collect telemetry data from systems with the goal of making them observable.

Telemetry data covers a number of signals, of which we focus on three kinds:

  • logs - allow you to reconstruct what and why something happened

  • metrics - collect measurements from interesting state of the system

  • traces - correlate units of work and establish there parent/child relationship, even across system boundaries

Open Telemetry has things to say about all this data.

Note

Currently, we don’t use Open Telemetry tooling for logs, but projects typically set up structlog which get scraped and persisted in some monitoring backend. In the future, we will include the helpers for this in maykin-common.

Quickstart (tl;dr)

Install the extra dependencies:

uv pip install maykin-common[otel]

Call the initialization code:

src/my_awesome_project/setup.py
 1 import os
 2 import warnings
 3
 4 from maykin_common.otel import setup_otel
 5
 6 def setup_env():
 7     load_dotenv(...)
 8
 9     os.environ.setdefault("DJANGO_SETTINGS_MODULE", ...)
10     if "OTEL_SERVICE_NAME" not in os.environ:
11         warnings.warn(
12             "No OTEL_SERVICE_NAME environment variable set, using a default. "
13             "You should set a (distinct) value for each component (web, worker...)",
14             RuntimeWarning,
15             stacklevel=2,
16         )
17         os.environ.setdefault("OTEL_SERVICE_NAME", "my-awesome-project")
18
19     setup_otel()
20
21     # other initialization...

See the Best practices for more information about the OTEL_SERVICE_NAME environment variable usage.

Note

Seeing this working is not obvious - telemetry sending needs to be enabled ( it’s initially disabled in default-project) and you need a sink to send the OTLP data to. Check default-project or the Maykin docs with some reference instructions if your project has not been prepared yet.

If you use Celery with process pool (the default), then your worker invocation must set _OTEL_DEFER_SETUP=True in the environment to defer the initialization until the worker process has forked.

Python Open Telemetry SDK

maykin_common.otel.setup_otel() calls the setup functions from the python SDK. The toolchain is roughly comprised of two core packages + some extensions:

  • opentelemetry-api - for library authors, foundation for the SDK

  • opentelemetry-sdk - the concrete implementations and project-specific integrations

maykin-common uses the SDK package to provide an opinionated, pre-configured ready to use setup. You should not have a need to override this.

If/when we define metrics in other modules of maykin-common, you can only use the API exposed from opentelemetry-api. Usage of the opentelemetry-sdk package is restricted to the maykin_common.otel module.

There are additional contrib packages with library/framework specific instrumentation, like the opentelemetry-instrumentation-django package. This has all been pre-configured in maykin-common.

The examples documentation can be interesting.

Defining metrics

By default, the wsgi instrumentation (set up via the Django instrumentation) captures spans of request/response cycles. It also captures request metrics, like the duration, annotated with context like the path, method etc.

Application developers can provide a lot of extra value by defining and tracking their application-specific metrics, because you have the context of the application and know which data/information is interesting.

Defining and using a metric is pretty straightforward:

 1 # in metrics.py
 2 from opentelemetry import metrics
 3
 4 meter = metrics.get_meter("my_awesome_project.my_module")
 5
 6 export_counter = meter.create_counter(
 7     "exports",
 8     description="The number of exports triggered by users",
 9 )
10
11
12 # in views.py
13 from .metrics import export_counter
14
15 def export(request, pk: int):
16     export_counter.add(1, {"pk": pk, "user": request.user.username})
17     return _create_export(pk=pk)

Warning

Resist the temptation to use __name__ for the meter definitions! See opentelemetry.sdk.metrics.MeterProvider.get_meter()

Note

Other packages that we maintain can also opt-in to defining and tracking metrics in the future.

Tracing

maykin_common.otel.setup_otel() calls opentelemetry instrumentors which automatically add traces for Django requests, Redis and PostgreSQL queries, Celery tasks and external HTTP requests performed with the requests library.

It’s also possible to add manual spans for any part of the code. Opentelemetry provides a context manager tracer.start_as_current_span for it:

 1 from opentelemetry import trace
 2
 3 tracer = trace.get_tracer("my_awesome_project.my_module")
 4
 5
 6 def do_work():
 7     print("doing work outside of the span")
 8
 9     with tracer.start_as_current_span("span-name") as span:
10         print("doing some work, that span will track")
11
12     print("doing more work outside of the span")

It’s also possible to use tracer.start_as_current_span as a decorator:

1 from opentelemetry import trace
2
3 tracer = trace.get_tracer("my_awesome_project.my_module")
4
5
6 @tracer.start_as_current_span("span-name")
7 def do_work():
8     print("doing some work, that span will track")

The Opentelemetry documentation provides more examples how to create spans.

Best practices

Service name vs. deployment environment

Don’t put the deployment target (prod, acc, test…) in the service name, as that leads to higher cardinality labels which has a negative impact on storage and query performance. Instead, make sure to properly define the ENVIRONMENT Django setting, which is also used by our Sentry SDK initialisation.

Use different service names for different logical units

The Django application (deployed with uwsgi, for example) is a different logical unit than the celery worker processing background tasks. In fact, even different task queues (e.g. high/low prio) are different units, and deserve their own easy-to-identify service name.

Tip

Define OTEL_SERVICE_NAME as environment variable in the entrypoint shell scripts like bin/docker_start.sh and bin/celery_worker.sh:

bin/celery_worker.sh
QUEUE=${CELERY_WORKER_QUEUE:=celery}
WORKER_NAME=${CELERY_WORKER_NAME:="${QUEUE}"@%n}

# Set defaults for OTEL
: "${OTEL_SERVICE_NAME:=my-project-worker-"${QUEUE}"}"

Suggested names to encourage consistency:

  • my-project - the django project that responds to HTTP requests

  • my-project-worker-celery, my-project-worker-highprio - each (dedicated) celery worker queue. If you have different queues set up, each one is typically its own service

  • my-project-flower - the celery monitoring service

  • my-project-scheduler - the celery beat task scheduler

Extract resource attributes for containers

Usually our applications are deployed in one of two ways:

  • on Kubernetes

  • on (virtual) servers with Docker engine

For the docker engine case, we can extract additional resource attributes by setting _OTEL_ENABLE_CONTAINER_RESOURCE_DETECTOR=true. Don’t do this on Kubernetes, as it may lead to conflicting information.

On Kubernetes, the recommendation is to enable the k8sattributeprocessor when deploying the Collector.

Authentication

The Collector may be API key or username/password protected. In that case, you can pass additional headers via the standardized environment variable:

OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64-username:password>"

Architecture

The essence is simple: instrumented services produce telemetry data that gets exported to a telemetry receiver which ensures the data gets persisted. Visualisation and monitoring tooling queries the telemetry data, making the service observable and provides (automated) alerting options.

We have made some decisions at the library level that correspond to the following diagram:

                                                    +----------------+
                                                    | metrics time   |
                                                    | series storage | >---+
                                                    +----------------+     |
+-----------+   telemetry                          ^                       | pull/query
| Service A |-------------+                       /                        |
+-----------+             |                      /                         |
                          |   +----------------+                           |
                          +-> |                |     +---------------+     |   +------------+
                              | OTel Collector |---> | spans storage | >---+---| Dashboards |
                          +-> |                |     +---------------+     |   +------------+
                          |   +----------------+                           |
+-----------+   telemetry |                     \                          |
| Service B |-------------+                      \                         |
+-----------+                                     v                        |
                                                  +--------------+         |
                                                  | logs storage | >-------+
                                                  +--------------+

Services

The services are the applications producing telemetry data. They can be different projects that each depend on maykin_common[otel], but they can also be different aspects of the same project - see the Best practices about different service names.

maykin_common.otel.setup_otel() sets up the application so that the produced telemetry data gets exported using the OTLP protocol. Telemetry gets pushed over gRPC or http/protobuf to an endpoint that can receive OTLP data.

Open Telemetry receiver

The receivers are applications deployed/running somewhere that can accept telemetry data in the OTLP format. They receive the telemetry from the services.

Open Telemetry Collector is a vendor-agnostic software that can receive, process and export telemetry data. It does not have a storage of its own, but instead exports the telemetry data according to configuration parameters.

The collector is not a hard requirement - many storage backends support ingesting OTLP data directly, but having a centralised collector is very convenient and simplifies the service configuration.

Storage

The storage backends are applications that can receive and persist the telemetry data.

Typically, you can configure retention periods, and they used optimized databases for the nature of the telemetry data. They’re usually also the applications that expose a query interface for the visualization tooling.

Different vendors typically compete with each other at this level. Some well known examples are:

  • Prometheus, InfluxDB, Datadog, Splunk for time-series data (typically metrics)

  • Loki, Signoz, Logtail, Datadog, Splunk for logs

  • Jaeger, Elastic APM, Tempo, Datadog, Splunk for distributed traces

Commercial offerings typically provide an all-in-one solution for all types of telemetry.

Dashboards/visualisation/alerting

Software like Grafana and Kibana specialize in querying and displaying observability data. Typically you can define dashboards with visualisations to explore the data that was ingested.

This is typically done by defining queries (in promql for Prometheus, logql for Loki etc.) which filter on labels of telemetry data (e.g. show only metrics from production and exclude test/acceptance environments) and may combine different metrics even, ultimately leading to easy-to-understand graphs to see what the state of the system is/was.

Troubleshooting

Combining all this with pre-forking application servers like uwsgi and gunicorn is a challenge. Some issues were encountered and the code has been adapted for use with uwsgi, but we can’t guarantee that all uwsgi configuration options will work out of the box.

  • --py-call-uwsgi-fork-hooks has been observed causing segfaults, even though this is recommended/required by the Sentry SDK (which it only uses for its telemetry features so we think it can be ignored)

  • --lazy-apps has been observed in the OTel setup not being executed. It’s possible that the @postfork is mutually exclusive with --lazy-apps.

  • Calling an instrumenter (SomeInstrumentor().instrument()) in the postfork hook can lead to no metrics being collected at all, which looks as if it’s an exporter problem.