Open Telemetry
Added in version 0.7.0: Added tooling for open telemetry.
Open Telemetry (OTEL) is a framework and standard to collect telemetry data from systems with the goal of making them observable.
Telemetry data covers a number of signals, of which we focus on three kinds:
logs - allow you to reconstruct what and why something happened
metrics - collect measurements from interesting state of the system
traces - correlate units of work and establish there parent/child relationship, even across system boundaries
Open Telemetry has things to say about all this data.
Note
Currently, we don’t use Open Telemetry tooling for logs, but projects typically set up structlog which get scraped and persisted in some monitoring backend. In the future, we will include the helpers for this in maykin-common.
Quickstart (tl;dr)
Install the extra dependencies:
uv pip install maykin-common[otel]
Call the initialization code:
1 import os
2 import warnings
3
4 from maykin_common.otel import setup_otel
5
6 def setup_env():
7 load_dotenv(...)
8
9 os.environ.setdefault("DJANGO_SETTINGS_MODULE", ...)
10 if "OTEL_SERVICE_NAME" not in os.environ:
11 warnings.warn(
12 "No OTEL_SERVICE_NAME environment variable set, using a default. "
13 "You should set a (distinct) value for each component (web, worker...)",
14 RuntimeWarning,
15 stacklevel=2,
16 )
17 os.environ.setdefault("OTEL_SERVICE_NAME", "my-awesome-project")
18
19 setup_otel()
20
21 # other initialization...
See the Best practices for more information about the OTEL_SERVICE_NAME
environment variable usage.
Note
Seeing this working is not obvious - telemetry sending needs to be enabled ( it’s initially disabled in default-project) and you need a sink to send the OTLP data to. Check default-project or the Maykin docs with some reference instructions if your project has not been prepared yet.
If you use Celery with process pool (the default), then your worker invocation must set
_OTEL_DEFER_SETUP=True in the environment to defer the initialization until the
worker process has forked.
Python Open Telemetry SDK
maykin_common.otel.setup_otel() calls the setup functions from the
python SDK. The toolchain is
roughly comprised of two core packages + some extensions:
opentelemetry-api- for library authors, foundation for the SDKopentelemetry-sdk- the concrete implementations and project-specific integrations
maykin-common uses the SDK package to provide an opinionated, pre-configured ready
to use setup. You should not have a need to override this.
If/when we define metrics in other modules of maykin-common, you can only use the API
exposed from opentelemetry-api. Usage of the opentelemetry-sdk package is
restricted to the maykin_common.otel module.
There are additional contrib packages with library/framework specific instrumentation,
like the opentelemetry-instrumentation-django package. This has all been
pre-configured in maykin-common.
The examples documentation can be interesting.
Defining metrics
By default, the wsgi instrumentation (set up via the Django instrumentation) captures spans of request/response cycles. It also captures request metrics, like the duration, annotated with context like the path, method etc.
Application developers can provide a lot of extra value by defining and tracking their application-specific metrics, because you have the context of the application and know which data/information is interesting.
Defining and using a metric is pretty straightforward:
1 # in metrics.py
2 from opentelemetry import metrics
3
4 meter = metrics.get_meter("my_awesome_project.my_module")
5
6 export_counter = meter.create_counter(
7 "exports",
8 description="The number of exports triggered by users",
9 )
10
11
12 # in views.py
13 from .metrics import export_counter
14
15 def export(request, pk: int):
16 export_counter.add(1, {"pk": pk, "user": request.user.username})
17 return _create_export(pk=pk)
Warning
Resist the temptation to use __name__ for the meter definitions! See
opentelemetry.sdk.metrics.MeterProvider.get_meter()
Note
Other packages that we maintain can also opt-in to defining and tracking metrics in the future.
Tracing
maykin_common.otel.setup_otel() calls opentelemetry instrumentors which automatically
add traces for Django requests, Redis and PostgreSQL queries, Celery tasks and external HTTP
requests performed with the requests library.
It’s also possible to add manual spans for any part of the code.
Opentelemetry provides a context manager tracer.start_as_current_span for it:
1 from opentelemetry import trace
2
3 tracer = trace.get_tracer("my_awesome_project.my_module")
4
5
6 def do_work():
7 print("doing work outside of the span")
8
9 with tracer.start_as_current_span("span-name") as span:
10 print("doing some work, that span will track")
11
12 print("doing more work outside of the span")
It’s also possible to use tracer.start_as_current_span as a decorator:
1 from opentelemetry import trace
2
3 tracer = trace.get_tracer("my_awesome_project.my_module")
4
5
6 @tracer.start_as_current_span("span-name")
7 def do_work():
8 print("doing some work, that span will track")
The Opentelemetry documentation provides more examples how to create spans.
Best practices
Service name vs. deployment environment
Don’t put the deployment target (prod, acc, test…) in the service name, as that leads
to higher cardinality labels which has a negative impact on storage and query
performance. Instead, make sure to properly define the ENVIRONMENT Django setting,
which is also used by our Sentry SDK initialisation.
Use different service names for different logical units
The Django application (deployed with uwsgi, for example) is a different logical unit than the celery worker processing background tasks. In fact, even different task queues (e.g. high/low prio) are different units, and deserve their own easy-to-identify service name.
Tip
Define OTEL_SERVICE_NAME as environment variable in the entrypoint shell
scripts like bin/docker_start.sh and bin/celery_worker.sh:
QUEUE=${CELERY_WORKER_QUEUE:=celery}
WORKER_NAME=${CELERY_WORKER_NAME:="${QUEUE}"@%n}
# Set defaults for OTEL
: "${OTEL_SERVICE_NAME:=my-project-worker-"${QUEUE}"}"
Suggested names to encourage consistency:
my-project- the django project that responds to HTTP requestsmy-project-worker-celery,my-project-worker-highprio- each (dedicated) celery worker queue. If you have different queues set up, each one is typically its own servicemy-project-flower- the celery monitoring servicemy-project-scheduler- the celery beat task scheduler
Extract resource attributes for containers
Usually our applications are deployed in one of two ways:
on Kubernetes
on (virtual) servers with Docker engine
For the docker engine case, we can extract additional resource attributes by setting
_OTEL_ENABLE_CONTAINER_RESOURCE_DETECTOR=true. Don’t do this on Kubernetes, as it
may lead to conflicting information.
On Kubernetes, the recommendation is to enable the k8sattributeprocessor when deploying the Collector.
Authentication
The Collector may be API key or username/password protected. In that case, you can pass additional headers via the standardized environment variable:
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64-username:password>"
Architecture
The essence is simple: instrumented services produce telemetry data that gets exported to a telemetry receiver which ensures the data gets persisted. Visualisation and monitoring tooling queries the telemetry data, making the service observable and provides (automated) alerting options.
We have made some decisions at the library level that correspond to the following diagram:
+----------------+
| metrics time |
| series storage | >---+
+----------------+ |
+-----------+ telemetry ^ | pull/query
| Service A |-------------+ / |
+-----------+ | / |
| +----------------+ |
+-> | | +---------------+ | +------------+
| OTel Collector |---> | spans storage | >---+---| Dashboards |
+-> | | +---------------+ | +------------+
| +----------------+ |
+-----------+ telemetry | \ |
| Service B |-------------+ \ |
+-----------+ v |
+--------------+ |
| logs storage | >-------+
+--------------+
Services
The services are the applications producing telemetry data. They can be different
projects that each depend on maykin_common[otel], but they can also be different
aspects of the same project - see the Best practices about different service
names.
maykin_common.otel.setup_otel() sets up the application so that the produced
telemetry data gets exported using the OTLP protocol. Telemetry gets pushed over
gRPC or http/protobuf to an endpoint that can receive OTLP data.
Open Telemetry receiver
The receivers are applications deployed/running somewhere that can accept telemetry data in the OTLP format. They receive the telemetry from the services.
Open Telemetry Collector is a vendor-agnostic software that can receive, process and export telemetry data. It does not have a storage of its own, but instead exports the telemetry data according to configuration parameters.
The collector is not a hard requirement - many storage backends support ingesting OTLP data directly, but having a centralised collector is very convenient and simplifies the service configuration.
Storage
The storage backends are applications that can receive and persist the telemetry data.
Typically, you can configure retention periods, and they used optimized databases for the nature of the telemetry data. They’re usually also the applications that expose a query interface for the visualization tooling.
Different vendors typically compete with each other at this level. Some well known examples are:
Prometheus, InfluxDB, Datadog, Splunk for time-series data (typically metrics)
Loki, Signoz, Logtail, Datadog, Splunk for logs
Jaeger, Elastic APM, Tempo, Datadog, Splunk for distributed traces
Commercial offerings typically provide an all-in-one solution for all types of telemetry.
Dashboards/visualisation/alerting
Software like Grafana and Kibana specialize in querying and displaying observability data. Typically you can define dashboards with visualisations to explore the data that was ingested.
This is typically done by defining queries (in promql for Prometheus, logql for
Loki etc.) which filter on labels of telemetry data (e.g. show only metrics from
production and exclude test/acceptance environments) and may combine different metrics
even, ultimately leading to easy-to-understand graphs to see what the state of the
system is/was.
Troubleshooting
Combining all this with pre-forking application servers like uwsgi and gunicorn is a
challenge. Some issues were encountered and the code has been adapted for use with
uwsgi, but we can’t guarantee that all uwsgi configuration options will work out
of the box.
--py-call-uwsgi-fork-hookshas been observed causing segfaults, even though this is recommended/required by the Sentry SDK (which it only uses for its telemetry features so we think it can be ignored)--lazy-appshas been observed in the OTel setup not being executed. It’s possible that the@postforkis mutually exclusive with--lazy-apps.Calling an instrumenter (
SomeInstrumentor().instrument()) in the postfork hook can lead to no metrics being collected at all, which looks as if it’s an exporter problem.