Dynatrace Inc.

09/11/2024 | Press release | Distributed by Public on 09/11/2024 13:20

Monitoring the OpenTelemetry Demo with Dynatrace Dashboards

Dynatrace Dashboards provide a clear view of the health of the OpenTelemetry Demo application by utilizing data from the OpenTelemetry collector. With these dashboards, you can monitor your application's usage and performance and identify potential issues like increasing failure rates. Learn how to use the Dynatrace Query Language (DQL) to investigate and pinpoint bottlenecks within your application's distributed traces.

Set up the Demo

To run this demo yourself, you'll need the following:

  • A Dynatrace tenant. If you don't have one, you can use a trial account.
  • A Dynatrace API token with the following permissions:
    • Ingest OpenTelemetry traces (openTelemetryTrace.ingest)
    • Ingest metrics (metrics.ingest)
    • Ingest logs (logs.ingest)

To set up the token, see Dynatrace API - Tokens and authentication in Dynatrace documentation.

  • A Kubernetes cluster (we recommend using this kind)
  • Helm, to install the demo on your Kubernetes cluster.

Once your Kubernetes cluster is up and running, the first step is to create a secret containing the Dynatrace API token. This will be used by the OpenTelemetry collector to send data to your Dynatrace tenant. The secret can be created using the following command:

API_TOKEN=""
DT_ENDPOINT=https://.dynatrace.com/api/v2/otlp

kubectl create secret generic dynatrace --from-literal=API_TOKEN=${API_TOKEN} --from-literal=DT_ENDPOINT=${DT_ENDPOINT}

After successfully creating the secret, the OpenTelemetry demo application can be installed using Helm. First, download the helm values file from the Dynatrace snippets repo on GitHub.

This file configures the collector to send data to Dynatrace using the API token in the secret you created earlier. Then, use the following commands to install the Demo application on your cluster:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install my-otel-demo open-telemetry/opentelemetry-demo --values otel-demo-helm-values.yaml

After invoking the helm install command, the application will eventually be up and running, and the OpenTelemetry collector will send data to your Dynatrace tenant.

Install the dashboard

In your Dynatrace tenant, navigate to Dashboards.

On the Dashboards page, you can import a JSON file containing the dashboard configuration using the Upload button. To install the OpenTelemetry Demo application dashboard, upload the JSON file. The file can be downloaded here.

Once the dashboard is imported, you'll see several charts representing the application's overall health.

The Service Level Monitoring section contains the following charts:

  • Top Spans: An overview of the most frequent spans ingested into Dynatrace.
  • Response Time Per Service: An overview of the response times for each service within the demo application.
  • Error Rate per Span: An overview of how many errored spans are generated per service.
  • Failed Spans over Time: A time series of how each service's error rate increases/decreases over time.
  • P95 Response time over Time: A time series of how each service's response time develops.
  • Errored Spans with Logs: A table that joins errored spans with related log entries.

These charts give you a quick overview of the overall application health, allowing you to quickly identify any services currently not behaving as expected. In combination with the time series charts, this will aid you in determining the point in time at which a service started to cause problems.

In addition to service-level monitoring, certain services within the OpenTelemetry demo application expose process-level metrics, such as CPU and memory consumption, number of threads, or heap size for services written in different languages.

Note that the developers of the respective services need to make these metrics available by exposing them via, for example, a Prometheus endpoint that can be used by the OpenTelemetry collector to ingest them and forward them to your Dynatrace tenant. Once the data is available in Dynatrace, DQL makes it easy to retrieve and visualize it on a dashboard.

Troubleshoot problems using the dashboard

Now, we'll see how the dashboard can help you spot problems and find their root cause. For this purpose, we'll use the in-built failure scenarios included in the OpenTelemetry demo. To enable failure scenarios, we need to update the my-otel-demo-flagd-config ConfigMap containing the feature flags of the application. The feature flags defined here contain the productCatalogFailure flag, for which you need to change the defaultVariant from off to on. After a couple of minutes, the effects of this change will be noticeable in the service level metrics as the failed spans start to increase:

Also, in the Errored Spans with Logs table, you'll notice a lot of entries that seem to be related to the retrieval of products, as indicated in the related log messages. Since all requests the load generator generates go through the frontend service, most logs related to failed spans are generated here. To pinpoint exactly where those requests are failing, use the trace.id field that is included in each table entry. Select a value within this column to go to the related distributed trace in the Dynatrace web UI.

Within the Distributed traces view, you get an overview of which services are involved in the errored trace and which of the child spans of the trace caused errors.

Here, notice that the error seems to be caused by the product service, particularly instances of the GetProduct call. Select the failed span to go to a detailed overview of the failed GetProduct request, including all attributes attached to the span, as well as a status description.

Here, you see that the status message indicates that the failures that occur are related to the feature flag we changed earlier. However, not all GetProduct spans are failing; only some are. Therefore, we need to investigate further by adding a specialized tile to our dashboard to evaluate whether the product ID impacts the error rate. For this, we use the following DQL query, which fetches all spans generated by the product service with the name oteldemo.ProductCatalogService/GetProduct, and summarizes the number of errored spans by the product ID.

This query confirms the suspicion that a particular product might be wrong. All the errors seem to be caused by requests for a specific product ID or a faulty entry in the product database.

Of course, this example is somewhat easy to troubleshoot as it's based on a built-in failure scenario. Still, it should give you an impression of how DQL enables you to investigate problems by analyzing how specific attributes attached to spans might affect the outcome of requests sent to a faulty service.

Conclusion

In this article, you've seen how to leverage the flexibility of Dynatrace Dashboards to visualize data coming from the OpenTelemetry collector and get an overview of your application's health. While service level metrics, such as response time, error rate, and throughput, are available as soon as your application exports traces via the OpenTelemetry collector, there is also the potential to obtain additional process level metrics (for example, CPU and memory consumption), provided that the services within your application also send those to the OpenTelemetry collector.

You've also seen how to use the distributed traces view in the Dynatrace web UI and the power of DQL to pinpoint the root cause of unexpected problems.

Ready to leverage the flexibility of Dynatrace Dashboards to visualize data coming from the OpenTelemetry collector and get an overview of your application's health?