11/05/2024 | News release | Distributed by Public on 11/05/2024 14:55
In modern, distributed applications, shared standards for performance and reliability are key to maintaining a healthy production environment and providing a dependable user experience. But establishing and maintaining these standards at scale can be a challenge: when you have hundreds or thousands of services overseen by a wide range of teams, there are no one-size-fits-all solutions. How do you determine effective best practices in such a complex environment? And how do you track whether or not services are consistently meeting your benchmarks throughout the ongoing development of your application? Monitoring is key, of course. But when you have hundreds or thousands of services, how do you ensure that each of them is effectively monitored in the first place? And what about before a service has been built? How do you enforce best practices for development, observability, and so on from the beginning of the software development life cycle?
With Scorecards, a feature of the Datadog Service Catalog, organizations can gauge everything from the performance and observability to the documentation and security of their services, guided by industry standards as well as custom rules, and provide actionable feedback to service owners on an ongoing basis.
In this post, we'll explore how Scorecards have helped SREs at Datadog provide robust guidelines to our service owners at scale, throughout the software development life cycle, by defining best practices in collaboration with a wide range of teams.
Before implementing Scorecards, Datadog relied on an entirely manual production-readiness review (PRR) process. This involved a member of our SRE team sitting down with each of our service owners to walk through a lengthy list of checks (related to everything from instrumentation and data security to standards for API and database usage) as they prepared for launch. Service owners had to make adjustments based on any checks that were not met, and SREs had to keep track of their work and comprehensively review the service again before it could be deployed.
This wasn't a scalable process. As Datadog grew and our services multiplied, so did our list of checks for production-readiness. Meanwhile, the SREs in charge of this process needed increasingly broad knowledge of many different specialized aspects of our platform. What's more, each of our services had to pass our PRR checks just once. But once they had, there was significant potential for these services to stray from the standards enforced by PRR in the course of ongoing development. Another complicating factor was the evolution of these standards themselves: while many of our PRR checks covered perennial best practices for instrumentation, security, and documentation, for example, others-such as those tied to the implementation of specific frameworks-could change.
With that in mind, we wanted to go beyond one-off reviews and augment the monitoring already implemented by service owners by continuously evaluating our services' adherence to best practices throughout the software development life cycle. That's where Scorecards came in.
Scorecards stemmed from the Datadog Service Catalog, which consolidates knowledge of an organization's services by providing information on their performance, reliability, and ownership in a central location. Scorecards were designed to enable any and all qualified stakeholders to provide guidelines for services and give actionable feedback to service owners.
The ability to define best practices in a collaborative and distributed way is key to scalability. At Datadog, we manage more than 8,000 discrete internal services. Given the scale and complexity of our systems, we inevitably have a lot of internal specialization. As a result, while our internal implementation of Scorecards is overseen by our SRE team, it has been a highly distributed effort: rules for different aspects of our services have been set in consultation with a wide range of stakeholders.
To roll out Scorecards internally, we first set out to identify which teams they could benefit most, and how. We began by identifying the types of personas that would benefit most from the ability to provide guidelines for our services:
We also considered service owners' priorities at a high level:
And management's priorities for our services:
After identifying the types of stakeholders that would be involved and considering their priorities at a high level, we set to work identifying which specific teams at Datadog were best positioned to set policy around each facet of our PRR process. These teams become our initial rule providers, defining standards for our services within their respective areas of expertise. SREs spent time speaking with these rule providers-teams like Security and Infrastructure-to understand the standards important to them and establish the benchmarks to be met in order for services to be considered production-ready.
Broadly speaking, our rules cover topics such as security, deployment practices, observability, chaos engineering, and documentation:
Outsourcing the basic definition of these rules to the teams specializing in these areas has helped us ensure that the standards we use to evaluate our services are applicable, relevant, and aligned with industry standards. Enabling these teams to define custom rules via the Scorecards API allows them to directly disseminate clear and actionable guidance to service owners on an ongoing basis.
When it comes to results, we use a central evaluation engine owned by our SRE team to continuously evaluate our rules. This helps us ensure consistency and prevent redundancy at scale. Including these results alongside other key service information in the Service Catalog has helped us fit Scorecards into our teams' existing workflows.
We also generate Scorecard reports that are sent to team Slack channels. Scorecard reports provide ongoing updates on how services are measuring up to expected standards, summarizing your highest- and lowest-scoring rules, services, and teams. These reports may be scoped to specific teams' services or cover every service defined in the Service Catalog.
By using Scorecards to evaluate our services throughout the software development life cycle, we've been able to:
We've introduced Scorecards progressively in order to avoid overwhelming service owners and allow time for them to give us their feedback. As such, our rollout of Scorecards is ongoing: as a large organization, it will take us some time to reach all of our teams.
We know that adoption depends on a culture shift: we're not providing value if we're just adding to cognitive load. To promote adoption of Scorecards, we're careful to communicate the "why" of each rule and its outcomes. While ideally the value of our rules and the reasons for their passing or failing outcomes are self-evident, service owners have competing priorities, and it is incumbent on our SRE team to ensure that these details are clearly stated. Including detailed descriptions for our Scorecard rules is key to this.
Clarifying the reasoning behind each outcome and indicating actionable next steps in the remarks field of each rule is also key.
In our rule descriptions, we are particularly careful to emphasize the time savings that satisfying each rule can provide. For example, one rule recommending the adoption of a framework could save service owners the trouble of worrying about another set of rules entirely.
Some of our rules-such as those that improve security-are a base-level requirement for all of our services. But as we've already noted, when you have thousands of independent services, there can be no one-size-fits-all solutions. As such, we work with service owners to refine our rules and manage exceptions on an ongoing basis. Like any growing organization, we manage an evolving catalog of services, and this evolution necessitates evolving standards and processes for performance, reliability, observability, and security.
As our rollout of Scorecards continues, we're keeping an eye on a few metrics:
So far, all of these metrics are pointing in the right direction, and stakeholders across Datadog have attested to the effectiveness of Scorecards in communicating centralized, up-to-date guidance at scale as our systems evolve. Incorporating this guidance directly in the Service Catalog-which is already integral to our service owners' development workflows-has helped strengthen communication between (and save time for) many of our teams.
At Datadog, Scorecards have played a critical role in helping us establish and maintain important standards for security, reliability, and performance at scale. With the involvement of a growing number of teams, we are continuously evaluating our thousands of independent services according to both industry-defined best practices and fine-tuned custom rules for internal usage. This has helped us eliminate knowledge silos and provide our teams with robust, up-to-date guidelines throughout the software development life cycle. It has also helped SREs and company leadership assess our services' compliance with expected standards and best practices without relying on manually compiled reports. Meanwhile, as our rollout continues, SREs have been working closely with service owners and the Service Catalog team to help drive improvements to Scorecards for our customers.
You can learn more about Scorecards and how teams throughout your organization can customize their own elsewhere on our blog, and check out our documentation to get started. And if you're new to Datadog, you can sign up for a 14-day free trial.