DigitalOcean Holdings Inc.

08/12/2024 | Press release | Distributed by Public on 08/12/2024 14:53

Contextual Vulnerability Management With Security Risk As Debt

Operationalizing a program that efficiently drives down meaningful security vulnerabilities is one of the most common, and often unsuccessful, challenges security organizations face. The scale of the problem can be immense, internal resistance to disruption is high, and the tools meant to alleviate this burden often simply shift the work to other (non-security) teams. The rigid, inflexible SLA-based approach often adopted by security organizations produces plenty of "security output as a proxy for progress."

However, meaningful risk reduction requires security organizations to take a flexible, contextualized approach. At DigitalOcean, we redesigned our vulnerability management program in 2022 to incorporate a concept of "security debt" and have seen success in driving meaningful risk reduction. In addition to the success of the security program, other business units have adopted our approach and taken up this reporting model for their own metrics.

We're not the first to attempt this kind of approach. In 2021, Carta published an article about measuring and reporting security risk as credit card-like debt. We also drew a lot of inspiration from Twilio Segment's approach and recommend their presentations on the subject: "Democratizing Vulnerability Management " and "Embracing Risk Responsibly." The foundation of our program was created from these peer resources. We are publishing this article to similarly share what we believe is a better model for vulnerability management with the broader information security community.

We've written about how security practices that shift toil work from security teams onto product or engineering teams impede an organization's velocity to deliver for their customers more than it improves the security posture. We spoke at OWASP Appsec Global in 2023 about the need for security organizations to take an enablement approach to security programs. In that talk, we admit to a revelation-we don't believe in SLAs for security vulnerabilities.

Security teams often focus on individual vulnerability tickets-this one has this SLA, and that one has that SLA. It doesn't matter how many tickets the dev team has overall, or what the system they are responsible for does, or any other contextual factors. These security teams often rely on tactical, obstructionist security outputs instead of strategic, holistic security outcomes. It is not uncommon for many of these issues to fall into the business's risk register for failing to be remediated by their SLA-a state which is often considered okay as long as you get an executive's signature each year on the risk register line item. Outputs as a proxy for progress. We like the reasoning behind Accepted Insecure Time (AIT) over the term "SLA," but we're still left with a general process that companies have been using for years that is not producing the desired outcomes and leaves every stakeholder involved in the process wanting something better.

At DigitalOcean, we never took the risk register/exceptions deferral path. At one point in our history, we held all teams accountable for planning and fixing all reported security issues. We have a lot of security-conscious developers at DigitalOcean, which is great! However, this meant that when we reported a security issue, even if it was a lower-severity problem, developers would often jump to work out a fix.

While this is the type of behavior we'd like to encourage for high-severity issues, lower-severity issues were creating high levels of roadmap disruption, without a real justification. When there were higher severity issues that deserved immediate attention, we were sometimes met with frustration from product owners and engineering managers for yet another disruption. We were perceived to be crying wolf more often than justified, which hurt our ability to rally support around actual emergencies. Some lower-severity issues that were not immediately acted upon fell into the Jira ticket abyss, and we didn't have a good way to keep track of those outstanding issues and follow up with the appropriate teams.

Our old vulnerability management process could be described as follows:

As security triaged new vulnerabilities, they would apply contextual insight into how this vulnerability impacted DigitalOcean. Given our somewhat unique posture as a cloud provider, some vulnerabilities treated by the wider industry as lower severity were a big deal for us, while other issues considered critical had little to no impact on our platform. Once this context was appended to a ticket, security would reach out to the appropriate application team and inform them of the vulnerability and of security's SLA.

Then, the waiting game began. Many issues would be acted upon and completed within the first few days. However, others would fall through the cracks. Someone on the security team would be responsible for following up with the app team-tapping them on the metaphorical shoulder and asking "Are we done yet? Is it fixed yet?"

Eventually, the security engineer's attention would be redirected to new, incoming issues or higher-priority tasks. When they next followed up with the app team, they might learn that the team had finished the issues several weeks or months ago, and had just not notified security. We could now close the vulnerability, but this meant our metrics over the prior period did not accurately reflect the true risk posture of the organization.

When we first began exploring a new vulnerability management model, we latched on to the idea of mirroring engineering platform health and thought of reporting on the Mean Time To Remediate (MTTR) security issues. However, we quickly concluded this would not work for our goals. MTTR was not an actionable metric-it reflected historical behavior from lines of business. If a bad score was reported, there might not be many new issues to work on, so the business would have no way to "fix" their score.

Alternatively, if we reported on the last X months of behavior to calculate MTTR, a line of business moving to fix a bunch of issues might not see an improvement to their score until "bad" months had cycled out of the calculation. Finally, there was the question of vulnerability severity-if a line of business was presented with one high and multiple low-severity problems, which should they focus on? The MTTR calculation might influence a team to fix all the Lows to have a greater result on their mean resolution time, whereas security would likely prefer they focus on the High issue first.

We decided to eschew the use of SLA-based vulnerability management and sought a new vulnerability management program design. We wanted to provide internal stakeholders with actionable, forward-looking data that allowed them to self-service which piece of security work was required of them while giving roadmap owners the autonomy to determine when the security work could be fit inside their teams' roadmaps.

We had three primary goals:

  1. Remediate important security issues faster than under an SLA-based approach

  2. Reduce the amount of timeline disruption this work created for engineering teams

  3. Enable engineering and product leaders to become decision-makers for security work, rather than the security team

We defined a new operating philosophy for the vulnerability management program: to enable the business to safely operate while resolving meaningful risks. This philosophy defined the following principles for our program:

  • Enable: Help the business accomplish its objectives; don't create roadblocks.

    • Critical security issues will demand immediate halting of other work so attention can be brought to the issues' resolutions. However, many issues can be more smoothly integrated into a team's roadmap within a few sprints without harming the organization's risk posture.
  • Meaningful: Drive impactful outcomes.

    • We will not track metrics such as "number of tickets opened or closed over a period of time" or other process outputs that serve as a proxy for measuring actual progress against our risk posture.
  • Safety: Fix actual issues in the organization.

    • Issues must be fixed at the same or greater cadence than prior, SLA-based approaches.
    • We must ensure the products the business delivers are safe for our customers, ourselves, and for the wider Internet community.

We ended up constructing a program that measures "security debt" of different lines of business in the organization. Instead of individual ticket SLAs, tickets were assigned weights according to their severities, and the amount of time those tickets remained incomplete was calculated as a measure of security risk-an amount of debt those lines of business were carrying.

Our security debt approach has resulted in proactive self-service vulnerability remediation by product owners across the business, without having to coordinate with the security team. The metric is actionable and forward-looking, focused on the current state rather than historical values. Security is around to help answer questions about vulnerabilities or validate remediations as a partner to the developers working on the issue. However, security is no longer responsible for chasing individual teams or tickets. Product owners no longer need to seek approval from security for their remediation plans, unless we see a significant deviation from agreed-upon thresholds and feel we need to apply pressure.

How does this work?

The foundation of this idea is the concept of debt accrual for security issues. Twilio Segment defined the following function to calculate an "error budget" for their program, inspired by SRE principles:

We changed the names of the fields a bit but ultimately kept this same function:

This debt metric equation translates to "you gain debt for every day the issue is unresolved past an expected remediation timeframe." The signifies that debt can never be negative. We converted our prior SLA timelines to "remediation recommendations." Any team that fixed their issues within that time frame did not incur any debt for the issue. That window could be considered our Accepted Insecure Time. Once an issue exceeds the time frame for that level of severity, it begins to accrue debt.

This means that debt accrues the fastest for the highest severity issues open the longest in our environment. This naturally draws roadmap owners' attention to working on the most important (as defined by security) issues, as those have the most meaningful impact on the debt metric. Teams who actively pay down their security issues within those timeframes will not see debt accumulate, rewarding this behavior without mandating it.

We then worked with each line of business to establish an appropriate debt threshold-or error budget-for that business area. Instead of a "one size fits all" approach, we factored in business criticality and importance into defining how much security debt is appropriate for each area of the business to carry. We then rolled up that data to present each team's debt compared to their line of business's debt threshold and primarily reported on what percentage of the business is adhering to their respective thresholds.

As an example, this chart with simulated data is what a line of business might see. Each line of business has their own bar chart on our top-level dashboard, and each roadmap owning-group inside that line of business has their respective bar on the chart. This line of business's debt threshold is clearly defined and is contextual to this part of the business and the sensitivity of the products delivered by this area. Each roadmap owner can assess their team against their line of business's threshold.

This chart informs an engineering leader that one of their teams has a lot of debt and needs to prioritize security work now, while their other two teams have some security work to plan but are comfortably below the threshold and can work those into upcoming sprints.

There are several lines of business across our organization and each is made up of a number of "roadmap owners"-teams overseeing our individual products, internal services, or other components. Our top-line metric is overall adherence to the customized debt thresholds for each line of business, or put another way, how many of the roadmap owners are adequately managing their security debt.

Meanwhile, engineering and project managers can drill into any bar chart to see what specific security items contribute to their score. As part of the data displayed, they can view how each security issue contributes to their security debt, alongside the severity assigned to each issue.

Higher-severity issues accumulate debt much more rapidly than low-severity issues, so this data focuses teams' attention on the highest severity, longest-standing issues in the environment. A team can look at the data below and clearly see they should prioritize the second ticket on the list, as its debt is higher than the first. What previously required one or more video calls with members of the security team is now fully self-service through the dashboard we provide to the organization.

We set our initial service level objective (SLO) on debt threshold adherence to 75% when rolling out the program. After two weeks, we hit 77% adherence and raised our threshold to 80%. The goal is to bring the company to mid-90% adherence, as the debt threshold already captures the nuanced contention between security and product priorities. The threshold is the agreed-upon compromise between the line of business and the security org regarding how much flexibility exists in scheduling security work. If a product acceleration, critical initiative, security incident, or other factor changes the top-level calculus about that flexibility, we can raise or lower the debt threshold for the respective part of the organization.

This grants us nuanced political levers and raises up discussion topics for our partners across the organization.

  • What factors should be considered when determining appropriate debt thresholds for different lines of business or systems?

  • How does team training and expertise in security impact the acceptable level of security debt?

  • How should recent operational challenges influence our approach to managing security debt?

For example, if a system has recently hit a period of frequent availability issues, does that suggest we should be less confident in that line of business's ability to manage security debt, and should therefore lower our debt threshold? Or should we raise it, to acknowledge the increased need to balance operational fixes with security fixes?

There is no one right answer to these questions. Still, this program allows us to engage our stakeholders across the business on these topics productively, rather than implementing esoteric mandates from on high in the security tower.

Unlike opaque SLAs without priority context, the business embraced the idea of security debt. This was partly due to focused change management efforts. Business leaders intuitively understood balancing debt alongside growth. This provided a familiar foundation of terminology to help drive attention toward the security issues that we wanted resolved.

"The concept of security debt allows me to have a clear and action-oriented conversation with the executive leadership team and Board of Directors around our security posture. Everyone in the room innately understands the concept of debt, debt accrual, and resolving debt."

- Tyler Healy, Chief Information Security Officer (CISO)

However, a crucial element of the success of this program rollout was our emphasis on seeking to influence business and product leaders to adapt to this new approach instead of driving a new mandate. We began designing this new process in Q3 2022 and spent the next two quarters driving conversations with each team we anticipated owning a debt score. We didn't just jump into the new idea. Instead, we started with a listening tour to understand existing pain points and provide a venue for those leaders to be heard. We asked key stakeholders across the product and engineering organizations: "What friction do you have with vulnerability management today? What would you like to accomplish?"

After actively listening to that feedback, we pitched the new idea. This allowed us to show the contrast between the previous approach and for the business leaders to feel it intuitively. We gathered their feedback on the new approach, asking questions such as: "Does this approach solve the challenges you described? What concerns would you have with the new approach?"

We sought this feedback from teams with high and low debt, ensuring this system made sense to all and drove attention to the most meaningful work for security.

"Security debt inside DigitalOcean is a solid operational program to have inside any technical organization. It calls attention to important areas needing focused improvement inside products or services, while also leaving room via agency within technical teams to prioritize when, where, and how much improvements get made. It is a good balance of opportunity and accountability."

- Nick Silkey, Senior Director of Engineering

This change management and feedback loop was critically important to the success of launching the program. Not only for the owning teams, but for us in security as we matured our program design. The questions asked of us prompted ideas we had not thought about and allowed us to deliver a more complete program. Security is a collaborative effort and this was a fantastic example of that across DigitalOcean.

As this new approach rolled out, we had some very vocal champions across the organization. Critically, these were stakeholders outside of security advocating for this approach. As we ran the program across a few months and people began interacting with the live data, we started seeing and hearing the most flattering things: other programs wanting to use the same approach. Soon, we were fielding questions about modifying the Security Debt approach for architectural or availability issues, even perhaps a "Product Debt" of highly requested customer features.

We view this as the greatest sign of success for this program-not only is it achieving the goals we set for our new vulnerability management program, not only are engineering leaders proactively driving down security issues, but others across the business began looking for ways to migrate their reporting to follow our approach. The business recently changed the way uptime incidents are handled with the introduction of an Availability Debt score.

"The Security Debt system has been praised for its utility, as it allows teams the flexibility to prioritize work without imposing unrealistic pressure. It recognizes that technical debt is not a simple numerical threshold but a complex backlog of work that accumulates over time. The Security Debt system provides accountability without feeling imposing.

We view Security Debt as a solid foundation for Availability Debt. By addressing two forms of technical debt within a unified framework, we reduce mental overhead for everyone, making it easier for teams to adapt."

- Jes Olson, Engineering Manager

Ralph Waldo Emmerson told us "It's not the destination, it's the journey." We're publishing this piece two years into this program launch, after being well-received and integrated across DigitalOcean. We want to share this with the community because we believe this is a powerful way to communicate, measure, and empower engineering and product teams to take ownership of security outcomes while still driving accountability.

But we're not finished. We're constantly monitoring and improving the program, looking for additional efficiencies in how we operate. As those improvements land, we'll continue to share more data and lessons learned.