Splunk Inc.

07/04/2024 | News release | Distributed by Public on 07/04/2024 17:43

What Is Service Continuity Management

If you held a competition to identify a term that describes the state of the world, probably VUCA would emerge as one of the leading contenders. The acronym - short for volatility, uncertainty, complexity, and ambiguity - is a great description of how unpredictable life is, and how disruption can overturn the stability we can become used to.

Natural or manmade disasters can occur at any time, resulting in damage, loss, or impairment that affects organizations from meeting their objectives and satisfying the needs of their stakeholders. The need to build resilience against such disruption is an essential competence that can both:

  • Preserve business value.
  • Ensure customers continue to enjoy services uninterrupted.

Enter: the service continuity management practice. As outlined in the ITIL® 4 framework, the practice of service continuity management helps to ensure a service provider's readiness to respond to all kinds of disruptive events that may impact core activities - and your credibility.

In this article, we will look at the concepts, processes, and measures that should be well understood by any student of service continuity.

What is service continuity management?

A framework for building organizational resilience, service continuity management helps an organization to:

  • Respond to disruption.
  • Ensure the availability and performance of services are maintained at sufficient levels.

Primarily a proactive measure, service continuity management is designed to prepare and organize the people, infrastructure, systems and resources required to predict and counter the negative effects resulting from a disaster.

(Related reading: business continuity vs. business resilience & how Splunk delivers business continuity, so you can go from disruption to resilience in no time.)

Key concepts in service continuity management

Disruptions come in many shapes and forms. An earthquake in Japan bringing down mobile communication services. A COVID-19 outbreak infecting air traffic control staff leading to cancellation of flights at London Gatwick Airport. An outage on the FAA's NOTAM system resulting in thousands of US flights being canceled or delayed.

No matter the source of a disruption or its magnitude, users and other stakeholders expect a service provider to continue providing services at acceptable predefined levels. Time is of the essence when it comes to recovery and resumption of operations, so the service provider is expected to put in place mechanisms to ensure the enterprise is ready to swiftly respond to any incident or disaster once it occurs.

Processes

Service continuity supports the overall business continuity management from the perspective of operational risks. The ISO 22301 standard for business continuity management systems outlines two main processes that serve as the basis for planning for service continuity:

  • Business Impact Analysis. Through a business impact analysis (BIA), the enterprise identifies activities that support the provision of products and services, and assesses the impacts over time of not performing these activities. Based on the analysis, prioritized timeframes for resuming these activities at specified minimum acceptable levels are determined. Finally, dependencies and supporting resources for these activities are identified.
  • Risk Assessment. Here, the enterprise identifies risks that could result in the disruption of products and services and assesses these risks in terms of likelihood and impact. The risks that are evaluated with the highest impact are prioritized, then mitigation strategies and plans are agreed to treat them to an acceptable level.

Continuity requirements

The information from these two processes helps in informing the service continuity requirements which are usually outlined as target timelines. These include:

Recovery Time Objective (RTO): The maximum period of time following a service disruption that can elapse before the lack of business functionality severely impacts the organization. This is the maximum agreed time within which a product or an activity must be resumed, or resources must be recovered.

Maximum Acceptable Outage (MAO): The time it would take for adverse impacts, which might arise as a result of not providing a product/service or performing an activity, to become unacceptable. The MAO is longer than the RTO by an amount which accounts for the organizational risk appetite.

Recovery Point Objective (RPO): The point to which the information that is used by an activity must be restored in order to enable the activity to operate effectively upon resumption. This point is defined by time prior to disruption where information loss is acceptable.

Service Continuity Requirements

Service continuity strategies

The continuity requirements inform the service continuity strategies.

Business stakeholders would prefer that their IT systems have the lowest levels of RTO and RPO (e.g. under 10 seconds or less), but they should be well informed that to get faster recovery with low data loss requires additional resources and configurations. For example: maintaining a disaster recovery site that has real time replication of all information in a primary site or cloud can run into millions of dollars, depending on the continuity requirements.

Therefore, set your continuity targets on an application-by-application basis, since each application has a direct correlation with operational complexity and implementation cost. (For instance, cloud providers such as AWS provide guidance in setting resilience policies including RTO/RPO targets per application.)

Service continuity strategies should take both proactive and reactive postures that ensure that the enterprise's service delivery mechanisms are adequately protected, and mitigation mechanisms can respond to and manage impacts of disruptive events.

A strategy must be supported by at least one solution which includes approaches, arrangements, methods, procedures, treatments, and actions to be carried out to implement the strategy.

Service continuity example strategies

Examples of continuity strategies outlined within the Business Continuity Institute's Good Practice Guidelines include:

  • Diverse site: Deliver services in two or more geographically dispersed sites that are active fulltime. This strategy delivers a high degree of resilience (RTO in minutes or hours) but is also costly.
  • Replication: Replicate service delivery in a geographically dispersed site that is dormant. This strategy is less costly than the previous one and supports an RTO that is greater than a few hours and less than a day.
  • Standby Facilities: This involves use of a facility that is shut down but can be activated when a disruption occurs. RTO support takes a day or more due to the effort to make the facility operation.
  • Subcontracting Work: This involves outsourcing service delivery to a third-party, where arrangements are made in advance. RTO can be less than a day if required service delivery resources are kept near the third-party's operational locations and can be quickly activated.
  • Post-incident Acquisition: This involves having a list of third-parties who can be engaged after a disruption to provide facilities. This supports services whose RTO is measured in days or weeks. (Related reading: third-party risk management.)
  • Insurance: This strategy provides compensation for loss of service delivery assets following a disruption but does not cover the full cost that the enterprise incurs including reputational loss, revenue drop, or regulatory penalties. Depending on impact, the RTO can be weeks or months.
  • Do Nothing: The organization maintains a wait-and-see approach where they react only after a disruption has taken place. Here, RTO is measured in months.

Service continuity plan

Once the enterprise has decided the preferred service continuity strategies, the relevant operational teams document the service continuity plan. This plan contains the detailed guidance to:

  1. Respond to a disruption.
  2. Resume, recover and restore service delivery in line with continuity requirements.

The continuity plan facilitates timely warning and communication to relevant stakeholders, and provides the information required to effectively respond to a disruption. The ISO 22301 standard states that the contents of the plan should be specific, flexible, focused, effective in minimizing impact, and have clear assignment of roles and responsibilities.

According to the EU Agency for Cybersecurity (ENISA), there are four stages that are covered in an IT service continuity plan:

  • Stage 1 Initial response. This mainly involves damage assessment, and invocation of the appropriate incident management teams
  • Stage 2 Service recovery. Here, the incident management teams work to recover the IT service to an initial minimum acceptable level.
  • Stage 3 Service delivery in abnormal circumstances. This involves implementing temporary measures to provide a limited sustainable service before normal service resumes.
  • Stage 4 Normal service resumption. Finally, the service is returned to the usual state that it was before the disruption.

Best practices for service continuity plans

Information to include. Some of the information contained in the service continuity plan includes continuity requirements, IT architecture, roles and responsibilities, invocation and damage assessment procedures, communication approach, escalation matrixes, recovery and fail-back procedures, test plans, contact details, dependencies, resources, and reporting requirements.

Regular review cycles at least annually. The service continuity plan should be regularly tested and reviewed at least annually to ensure that it remains relevant in supporting the organization's continuity objectives.

Employees, contractors, and any other stakeholder who is directly involved in the delivery of services should be trained on the continuity plans based on their role-specific competence requirements.

Maintaining continual service delivery

Implementing and maintaining service continuity plans is a significant strategic investment for any enterprise that wants to demonstrate to its stakeholders that it is resilient and trusted to continue delivering services in the face of devastation. Solutions to mitigate unacceptable risks and single points of failures should be carefully chosen to ensure they meet the service continuity requirements, while also being cost-effective, practical, and not introducing unnecessary complexity within the IT environment.

Service continuity management is not an easy undertaking and requires continued support across all management levels within the enterprise.