Splunk Inc.

07/16/2024 | News release | Distributed by Public on 07/16/2024 03:05

Cultural Conservatism Meets Technological Revolution

The path to digitalisation, however rewarding the goal, has proven to be a difficult one for many enterprises. One thing that is certain, however, is that visibility into application performance is vital not only to an effective technology strategy but to business success.

The use case study to follow outlines a series of events experienced by many companies and sets down some guideposts to help you ensure that you will know the kind of digital service your customers are getting and will allow you to respond rapidly to outages and brownouts and, as a result, minimise the impact on revenue and brand.

Business users at a European insurance company were experiencing performance degradation across a number of platforms. In general, keyboard to eyeball response 'seemed to be slow' and reaction to complaints from internal users and customers alike took a long time. Part of the issue was due to legacy processes and procedures which did not take into account the realities of the pace of business and levels of expectation characteristic of the modern digital economy. On Sundays, for example, critical applications were shut down regularly to implement planned changes while, at the same time, unplanned changes disrupted weekday activities all too frequently. It is interesting to note, however, that, unusually, this conservative approach to process and procedure was not matched by a conservative approach to the underlying technology. Cloud migration was beginning to take place while newer applications were increasingly modular and component-wise ephemeral, taking into account the opportunities for flexibility and continuous change that the cloud would eventually offer. Of course, this mismatch only made things more problematic from the perspective of the business and even began to impact release cycles at the technology level.

In a word, no one was happy with the way things were going.

Finding the Root Cause in a Booming, Buzzing Confusion of Microservices

One of the key issues that development teams faced was being able to rapidly and accurately pinpoint the root causes of incidents or performance problems.

While, historically, root cause analysis had always been challenging, it was not an impossible ask thanks to two factors:

  • First, application topologies were relatively simple and stable and cleanly segregated from the underlying infrastructure. That simplicity, stability, and segregation meant that, if the source of the problem lay within the application, it was a straightforward matter to determine the root cause, often requiring little more than visual inspection of a topology diagram. Things became a bit more interesting if elements of the infrastructure were to blame but even there, it was usually a question of spotting some very obvious anomalies in relatively small collections of metrics.
  • Second, internally business decision makers generally expected that solving performance and availability issues just took time so the pressure of IT Ops and Development practitioners. In the fullness of time, they would give a good accounting of what went wrong and would have implemented a solid fix.

The aggressive refactoring of legacy code and the crafting of new code in the form of microservices upset the apple cart, however. The newer applications had no permanent topology to speak of and the individual microservices could flit in and out existence at a second's notice. Finally, the boundary between application and infrastructure blurred as it became impossible to say when a microservice was 'acting like an app' or 'acting like a server'. Furthermore, business decision makers, like the rest of society, became accustomed to 'Google time' and increasingly questioned the leisurely pace of response provided by corporate IT. In the end, it was decided that the need had become pressing to address the problem of diagnosing microservice based performance problems in a systematic way.

300

Of course, developers, in particular, were not blind to the issue. Each developer team had indeed tried to address performance problems on their own. Unfortunately, there were approximately 300 separate teams and, hence, 300 fragmented views of what was taking place in the digital environment, united only by the general adoption of open source Grafana/Prometheus technology as a foundation. Perhaps needless to say, the fragmentation did not help when it came to determining what was the source of an end-to-end business transaction's misbehaviour when that transaction crossed the boundaries of many developer domains of responsibility.

Why You Shouldn't Take It Literally…

The insurance company went to the market and, besides considering a completely open source solution, evaluated the offerings of a number of vendors, including Splunk, that are designated 'leaders' in the Gartner® Magic Quadrant™ for Application Performance Monitoring (APM) and Observability. The evaluation process itself was initially constrained by some of the insurance company's preconceptions. Since, as mentioned above, the infrastructure was the source of the 'hard root cause problems', vendors were asked to demonstrate their respective capabilities in the infrastructure monitoring case. Additionally, since the infrastructure analyses had always been conducted through the examination of metrics, it was the metrics ingestion, analysis, and presentation components of the tooling that drew all of the attention. Splunk's competitors simply followed through on the company's requests. Their Proof of Concepts (POCs) were more or less all about how their solutions could solve infrastructure root cause issues using metrics. Splunk, however, took a different path.

Team Splunk recognised that the potential customer had misdiagnosed its problems and, hence, put forward an, at best, incomplete set of requirements. Regional Sales Managers (RSM's), Sales Engineers (SE's), Advisors, and Strategists, working together, convinced business and technology buyers alike that 1) issues pertained to the entire stack - not just the infrastructure layer - and, in order to deal with the full stack situation, developers and IT operations practitioners would need to make use of metrics and logs and, especially, traces to significantly cut Mean Time to Detect (MTTD) and the determination of the root cause. In other words, whilst the competition took the evaluation process as a given and followed the potential customers dictates in a slavish manner, Splunk seized control of the evaluation process and reshaped it both to the customer's actual needs and to Splunk's advantage. In the end, after an initial winnowing left Splunk and one other player in consideration, Splunk's Observability platform was selected.

New Realms and Beyond: More Change on the Way

The initial implementation was successful but the insurance company now recognises that selecting the right technology is only the first step. An evolution in process and procedure now needs to take place to better match the technology evolution which, even as this is written, is picking up pace yet again. It is now accepted that the company's technology future is in the cloud and little will be left behind on premises. Microservice style is now the house style and, if anything, the degrees of modularisation and ephemerality will increase as platforms move in the serverless direction. In other words, there is a lot more change on the way but when it comes to diagnosing issues surrounding the aberrant behaviour of microservices, Splunk's technology will be a constant for some time to come.

This European company's adoption of a more revolutionary attitude to digital transformation was not a result of unique circumstances. Any company, with a focus on the digitalisation and a strategic approach to data can get there with an assist from the Splunk technology portfolio. Unfortunately, until now, the reach of Splunk's observability functionality into the German marketplace has been somewhat limited due to the lack of a German geography realm that satisfies regulatory requirements. Hence, Splunk was not able to support many German companies as they coped with the accelerating pace of digital environment change, even when Splunk was already the strategic choice for on premises IT operations log management and security event and information management. Now, however, with the new realm, our German customers will be in a great position to build out a unified, AI-enabled approach to development, operational, and security data management.

Want more details on the Splunk realms and our solutions? Don't hesitate to get in touch.