Datadog Inc.

09/05/2024 | News release | Distributed by Public on 09/05/2024 09:14

Monitor Oracle Cloud Infrastructure with Datadog

Oracle Cloud Infrastructure (OCI) provides cloud infrastructure and platform services designed to support a broad spectrum of cloud strategies and workloads. OCI provides enterprise customers with scale-up resource scaling architectures, ultra-low-latency networks, and more to help them migrate legacy workloads to the cloud, while supporting cloud-native applications via an expansive network of cloud partners and services.

Datadog's OCI integration enables you to gain full visibility into your OCI environment. Using our out-of-the-box (OOTB) dashboards, you can visualize a high-level overview of your infrastructure and applications, as well as gain granular insights into over 20 major OCI services you depend on, such as Oracle Database, OCI Compute, and Service Gateway.

In this blog post, we'll discuss how to use our integration to:

Gain full visibility into your Oracle Cloud Infrastructure environment

Datadog's OCI integration enables you to monitor your entire OCI stack within a single platform alongside other third-party technologies within your environment. After installing the integration and configuring OCI for metrics collection, Datadog will begin collecting metrics from your OCI services in minutes and populate our OOTB dashboards to assist your investigations. For instance, our overview dashboard delivers a top-down view into your OCI environment so you can gain quick insights into metrics such as the total bytes traveling in and out of your network, average database execution time, and GPU performance. These metrics can serve as overall performance indicators or highlight glaring issues.

If your enterprise organization is looking to shift on-prem Oracle infrastructure to the cloud, you might be wondering how to maintain visibility into your workloads during your migration and create frictionless monitoring workflows for the future. Datadog enables you to monitor on-prem Oracle applications, middleware, and databases alongside newly adopted OCI cloud services within a single platform. Using Datadog Host Map (included in Datadog Infrastructure Monitoring), you can visualize the health and resource utilization of your entire infrastructure-you can easily filter for self-managed (or Oracle-managed) on-prem hosts or OCI cloud compute instances to monitor the different parts of your infrastructure.

Similarly, If you're running a shared Oracle RAC database on-prem but planning to migrate workloads to OCI databases to reduce management costs and complexity, you can monitor both on-prem Oracle databases and OCI database services such as Autonomous Database within Datadog. Our OCI integration helps you visualize business-critical metrics for these database services, while Datadog Database Monitoring (DBM) gives you query-level visibility into your managed Oracle databases to help you troubleshoot long-running queries.

This unified view also applies to multi-cloud strategies-for example, your organization may primarily rely on Azure to host cloud applications, but these applications rely on Oracle Autonomous Database for automated data processing. Using our Azure integrations for container applications and web applications, you can easily monitor your Azure-based application performance while troubleshooting your OCI database in DBM.

Track OCI GPU health and performance to optimize AI workloads

With the rapid development of AI and LLM technology, organizations invested in these product areas are using OCI superclusters to deploy and scale machine learning workloads. These workloads can be very expensive-to ensure efficient resource usage and control growing cloud spend, you'll need to monitor the GPU performance of your OCI Compute instances. Datadog's OCI integration provides an in-depth look into OCI Compute metrics and subsequent GPU infrastructure health.

The first metric you'll want to track is GPU utilization. If you observe low utilization when running your workloads, you're likely overspending and can afford to decrease the number of GPUs on your instances without affecting performance. Vice versa, high utilization can result in throttling and service slowdowns, which require additional GPUs to be resolved.

GPU power draw and GPU temperature are also good indicators of general performance-you'll want to ensure that these metrics are consistently above certain thresholds even when your instances are idle. Low power draw and temperatures can create instability and may also foreshadow future throttling. On the other hand, if your temperatures are too high, you'll likely reach a bottleneck before utilization spikes. This may result from high time complexity within your workloads-using the Datadog Continuous Profiler, you can pinpoint resource-intensive methods and lines of code that need to be simplified.

If you're monitoring GPU performance to maximize the performance of your Large Language Models (LLMs), you can correlate your OCI GPU metrics with operational performance metrics in Datadog LLM Observability. LLM Observability collects metrics such as error rate, call response time, and the average tokens per call, as well as end-to-end traces that detail each task executed before your model generates its final response. If you encounter low power draw or high GPU utilization, you can pivot to LLM observability to investigate whether or not these issues are impacting your LLM applications.

Monitor Oracle Databases with systems metrics

Datadog's OCI integration delivers OOTB system metrics for Oracle base database, RAC, and Autonomous Database. Using either our overview dashboard or service specific dashboards, you can monitor metrics such as total remaining storage, execution time, and CPU utilization to determine whether your database is in good health. If your remaining storage or CPU utilization is approaching its maximum capacity, it'll likely create performance issues. These symptoms are often a result of increased traffic spikes or slow queries consuming a large amount of CPU, which you can then investigate in Datadog DBM.

DBM gives you visibility into your normalized queries so you can determine what types of queries are affecting your database performance. Metrics such as the average number of wait groups can indicate that your database has insufficient cores to handle incoming workloads and would require you to scale up the size of your database instances. By selecting a query statement you'd like to investigate, you can view a detailed summary that includes additional query metrics, a history of its explain plans, as well as its users and hosts. To learn more about monitoring Oracle managed Databases, check out our dedicated blog post.

Start monitoring your Oracle environment with Datadog

Datadog's OCI integration enables you to monitor your OCI environment side-by-side with on-prem infrastructure and multiple cloud provider services. You can view the full list of OOTB OCI services and metrics in our documentation. Additional OCI GPU metrics such as throughput, frame buffer, and row remap failures are available through our Nvidia DCGM Exporter.

If you don't already have a Datadog account, sign up for a free 14-day trial today.