Splunk Inc.

07/03/2024 | News release | Distributed by Public on 07/03/2024 10:56

Building an AI Assistant in Splunk Observability Cloud

Splunk Observability Cloud is a full-stack observability solution, combining purpose-built systems for application, infrastructure and end-user monitoring, pulled together by a common data model, in a unified interface. This provides essential end-to-end visibility across complex tech stacks and various data types, such as metrics, events, logs, and traces (MELT), as well as end-user sessions, database queries, stack traces and more. However, the sheer volume and variety of data can make pinpointing and resolving issues a daunting task, often relying heavily on individual expertise and familiarity with tools. This is where the AI Assistant comes into play, providing a conversational interface to surface insights and streamline investigation and exploration across the entire environment.

At its core, the AI Assistant allows users to interact with Observability data and compose workflows (e.g., troubleshooting, exploration) using natural language. This addresses a wide range of observability activities and use cases: inspecting and analyzing health of a Kubernetes cluster, identifying sources of latency in a complex service topology, finding span attributes associated with errors, pinpointing root cause or surfacing patterns among logs, and so on.

Figure 1. The user interface of the AI Assistant in Observability Cloud

Splunk has a rich history of applying advances in language modeling to enhance offerings across observability and security. Recently, we have announced fundamental enhancements to the Splunk AI Assistant for SPL (Search Processing Language), which provides a natural language interface for constructing and understanding SPL, the language for expressing queries in the Splunk platform. The AI Assistant in Observability Cloud (the "AI Assistant") represents our continued investment in this area. (The AI Assistant is currently available to select private preview participants upon Splunk's prior approval.)

The ability of large language models (LLMs) to produce impressive answers and analysis in the observability domain inspired us to bring the background knowledge and reasoning capabilities of modern LLMs into our products. This blog post will discuss our high-level technical approach, some of the challenges we faced in adapting the approach to our domain, and some general ideas on where things might be headed.

The Agent Framework

Our approach follows the agent paradigm, wherein a generally capable LLM is augmented with access to various tools. The main conversational thread is governed by an orchestration agent, powered by an LLM with the key capabilities of understanding a user's intent, planning, calling the right tools, and reasoning with tool responses. In the context of Splunk Observability Cloud, this orchestrator can understand a user's request; formulate a plan (sequence of tool invocations); route requests to the specific microservices provided by Splunk Application Performance Monitoring (APM), Splunk Infrastructure Monitoring, Splunk Log Observer Connect, and so on; and synthesize the tool responses in order to answer the initial request. The LLM is provided with a list of tools with descriptions and signatures. For a complicated task, it can chain multiple tools together to obtain a final answer. For example, if a user asks for "the root cause of the high error rate in payment service", the orchestrator needs to understand the user's intention is to troubleshoot a certain service. By planning, the LLM knows it needs to first search the service names to find one like "payment", call APM APIs for an error breakdown, and then extract information from the breakdown to identify a possible root cause.

The System Prompt for the Orchestrator

Designing efficient and safe system prompts is crucial for guiding the orchestration agent to understand the context and constraints for a user's query. The system prompt includes:

  • context instructions for observability to provide background and general guidelines for the agent to serve as a "helpful assistant" for Splunk Observability Cloud;
  • safeguarding instructions to restrict the agent to only answer questions about observability and avoid inappropriate conversation; and
  • specific task instructions, guidelines for the agent to perform certain tasks in the observability domain, such as handling metric time series and tackling APM tasks.

Short-term Memory

The orchestrator is equipped with a short-term memory to retain contextual information, enabling it to make appropriate decisions. This memory encompasses the user's current query, the current conversation between the user and the agent, system prompts, and tool descriptions. This memory is short-term as it pertains to the current conversation only. The memory capacity is determined primarily by the context length of the orchestrator's LLM.

Tools

Having provided the agent a general purpose and guidelines, we then conceive of various observability data and platform elements as tools.
In this preview release, the agent's tools cover the following areas:

  • Application Performance Monitoring tools, including retrieving service and environment names, obtaining the upstream and downstream dependencies for a given service (graph-based service topology), getting the breakdown of service errors and latency by tag, and retrieving trace samples, trace errors and spans.
  • Infrastructure Monitoring tools, retrieving related metadata on infrastructure instances (e.g., EC2, K8s, RDS, Redshift), and returning navigator dashboards.
  • Incident (Alert) tools, including searching all and specific alerts in an organization.
  • Log tools to perform keyword search and extract log patterns via SPL.
  • Dashboard tools to search related dashboards, where each dashboard is a collection of charts that help monitor the health of various entities.
  • SignalFlow (i.e., metrics analytics; see section SignalFlow Generation Specialist below) tools, including a tool for searching metrics and associated metadata, and a specialized agent for generating and executing SignalFlow programs to analyze metric time series data.

Each tool comes with a carefully designed description and parameters. The orchestrator is responsible for tool selection and extracting parameters from the context (e.g., "the past hour" is mapped to a time range object ["-1h", "now"]).

A Hybrid Agent Approach

Some of the tools encapsulate multi-step workflows and are themselves backed by LLMs; we call these specialized agents sub-LLMs since they typically compute on a subset of the context available to the orchestration agent. These focus on handling specific tasks, such as SignalFlow generation, chart creation, root cause analysis, and so on. The specialists retain the ability to invoke other tools in the same manner as typical function calls.

Figure 2. The architecture of the AI Assistant in Observability Cloud

This hybrid strategy yields several advantages:

  • Flexible microservices-like approach that allows for rapid experimentation on specialists when required, and easier development and validation of different components.
  • Encourages us to think in interfaces, yielding simplicity and clarity in the implementation.
  • Provides flexibility to mix-and-match different LLMs (e.g., different GPT versions, fine-tuned GPTs, fine-tuned OSS) for both orchestration and sub-LLMs, thereby enabling different tradeoffs among quality, performance, and cost.
  • Enables longer conversations, by keeping some tokens out of the main conversational thread.

SignalFlow Generation Specialist

SignalFlow is the metrics analytics engine at the heart of Splunk Observability Cloud. It is a Python-like language that allows users to transform and analyze incoming streaming data, and write custom charts and detectors. Although SignalFlow is a powerful computational tool, like SPL it has a steep learning curve. We designed a specialized sub-LLM for SignalFlow generation that can generate programs from the user's natural language queries and task descriptions. For example, if a user asks for "the average cpu utilization", the agent will generate a SignalFlow program like:

data('cpu.utilization').mean().publish()

We utilized lessons learned from developing the AI Assistant for SPL, and found that chain-of-thought prompting and retrieval augmented generation greatly enhance the sub-LLM's ability to generate correct programs of moderate complexity, comparable to intermediate SignalFlow users. For the science and engineering details, please refer to our companion blog, "Generative AI for Metrics in Observability."

Challenges and Solutions

Handling Data Types

In addition to task decomposition, tool selection, and query generation, we needed to understand LLM capabilities in processing various data types (metrics, events, logs, and traces). Our general experience is as follows.

  • LLMs can make mistakes in analysis of time series data, so where possible we push computation into the analytics system. For example, it is preferable for a SignalFlow program to apply .mean(over='5m') than to ask an LLM to compute the average.
  • With minimal prompting/examples, LLMs have basic graph processing capabilities (in our case, reasoning across service dependencies).
  • Asking an LLM to process a large amount of logs data may be slow and not particularly high-quality, so where possible we arrange for it to operate on high-level constructs like patterns instead. When the LLM has to process a large quantity of input data, it may not even fit into the LLM's context window, introducing the additional challenge of processing in pieces.

Complicated Toolchain

As more tools are incorporated into an agent, the difficulty of selecting the right tool increases. It leads to higher error rates when addressing complicated tasks involving chaining multiple tools to arrive at final answers. For example, a typical troubleshooting journey for a service incident requires a multi-step workflow, such as:

environment name -> service name -> service topology -> service errors -> logs search

Ideally, the agent should be able to follow this workflow and call the functions correctly in sequence. However, on some occasions, the agent may fail to do so, for example by immediately using service topology without getting service/environment names.

Workflow-based Optimization

For complicated tasks, we optimize the orchestration agent by instructing it to follow some typical workflows. This optimization includes three steps:

  • Identifying challenging workflow examples. Given a set of task queries, we run the agent in multiple trials, and calculate the variance in tool use trajectories. The queries with high variance trajectories are selected, as the agent is not confident about such tasks.
  • Distilling workflow instructions from the examples. We manually annotate the correct tool use trajectories, and distill the patterns as workflow descriptions.
  • Instructing the agent to follow typical workflows via the system prompt. Finally, this workflow description is added to the system prompt to instruct the agent to use correct tool trajectories for similar use cases.

With such workflow based optimization, we can improve the performance on tasks that require complicated tool use, as the workflow instructions introduce extra domain knowledge for the agent to address these tasks.

Query Expansion

In many cases of tool use, the agent needs to extract the right search terms from user queries to search for certain information in systems that are typically keyword-based. For example, for the question of "What is the average disk usage?", the agent should retrieve the metric "disk.utilization". Ideal search terms would be "disk" and "utilization", but the agent usually extracts "disk" and "usage" as search terms, so that "disk.utilization" may not be a top hit.

We alleviate this issue by expanding the queries with the knowledge of LLMs. Specifically, we include a list of synonyms for commonly used terms in the observability domain, and expand the search terms using the synonym list. For the above example, the possible search terms can be expanded to "disk", "usage", and "utilization", increasing the recall of the search tool. The result is that keyword-based search systems behave more like semantic search systems.

Evaluation

There are two primary challenges for our agent evaluation:

  • Ground truth data. It takes non-trivial effort to collect ground truth data across a variety of realistic user scenarios. For a given query, we need to collect ground truth data for both the tool use trajectories and final responses. There might be multiple trajectories that arrive at the same correct answer.
  • Metric definition. Meaningful metrics are needed to evaluate both the trajectories of tool use and final responses. Standard metrics (e.g., accuracy for classification) are not directly applicable.

We developed a trajectory-based approach for both data collection and metric definition. The main idea is that, for a given query, we first run the agent for multiple trials, and then aggregate the same trajectories of tool use and also collect the final responses. During the ground truth collecting phase, we collect all trajectories and responses from multiple trials, and manually select the correct trajectories and responses. For the evaluation phase, we develop two metrics: trajectory match for tool use, and embedding-based similarity for final responses. The following figure shows how we match the trajectory: when the order of tool use is fully matched with the ground truth, it is regarded as a correct run. To measure the final response, we use the cosine similarity between the embeddings of the agent's responses and the ground truth response.

Figure 3. Overview of the trajectory-based evaluation method

We assembled a test set of questions that are representative of the questions that our users are likely to ask, and we verified the assistant's answers to the questions by identifying the correct answer in the Splunk Observability Cloud product. With many iterations of prompt engineering over system prompts, tool description, and workflows, we observe consistent improvement, and the AI Assistant achieves the designed business requirements with state-of-the-art performance.

Conclusion

Next Steps

Broadly speaking, we plan to explore how to extend our AI Assistant in multiple ways:

  • Depth of skills. We plan to investigate further refinements to the SignalFlow model, to add more complex capabilities with logs data along the lines of natural language to SPL, and to do more with trace data.
  • Agent skill expansion. In addition to adding depth to the existing capabilities, we are looking into incorporating other standard observability data sources (real user sessions, results of synthetic tests) to allow the AI Assistant to conduct more thorough investigations.
  • Workflow-based retrieval and fine-tuning. Workflow-based optimization will be an iterative development process: we need to continuously identify the challenging examples and update the agent with workflow instructions. As we have more coverage on these cases, the agent will become more capable. Furthermore, we are looking into using curated examples to fine-tune the agentic models with more robust results and less context, thus reducing token cost.

In short, we improve existing skills, we develop new skills, and we find new ways of putting skills together.

Final Thoughts

As a whole, we find developing LLM applications, especially compared to traditional software development, to be exhilarating: carefully-crafted additions to the system prompt can enable what are essentially new features (e.g., novel SignalFlow constructions yielding new insights, APM investigations that might otherwise require expert product understanding), and implementing a tool can unlock new workflows (e.g., navigations between APM and metrics data). The experience was also frustrating at times: apparently trivial modifications to the system could have surprisingly large effects on certain test scenarios, and certain stubborn hallucinations required hard-coded patches. In addition to standard software testing practices, we developed evaluation practices (eventually part of pipelines) with an eye towards helping us make product decisions.

Deploying the AI Assistant internally at Splunk has yielded a wealth of insights regarding user expectations and has helped to focus our research and development efforts. Several of our internal engineering teams are already using the AI Assistant in a range of use cases, from helping new team members better understand their systems, to in-depth analysis of issues in production. We are eager to see how users continue to interact with our AI Assistant and what use cases they hope to address, to further expand the AI Assistant's skills and workflows.

Co-authors and Contributors:

Joseph Ross is a Senior Principal Applied Scientist at Splunk working on applications of AI to problems in observability. He holds a PhD in mathematics from Columbia University.

Om Rajyaguru is an Applied Scientist at Splunk working primarily on designing, fine-tuning, and evaluating multi-agent LLM systems, along with working on time series clustering problems. He received his B.S. in Applied Mathematics and Statistics in June 2022, where his research focused on multimodal learning and low-rank approximation methods for deep neural networks.

Liang Gou is a Director of AI at Splunk working on GenAI initiatives focused on observability and enterprise applications. He received his Ph.D. in Information Science from Penn State University.

Kristal Curtis is a Principal Software Engineer at Splunk working on a mix of engineering and AI science projects, all with the goal of integrating AI into our products so they are easier to use and provide more powerful insights about users' data and systems. Prior to joining Splunk, Kristal received her Ph.D. in Computer Science from UC Berkeley, where she studied with David Patterson and Armando Fox in the RAD & AMP Labs.

Akshay Mallipeddi is a Senior Applied Scientist at Splunk. His principal area of focus is to augment AI Assistant in Observability Cloud by formulating strategies to improve data integration aspects, critical for the large language models. He is also involved in fine tuning large language models. He did his M.S. in Computer Science from Stony Brook University, New York.

Harsh Vashishta is a Senior Applied Scientist at Splunk working on the AI Assistant in Observability Cloud. He did M.S. in Computer Science from University of Maryland, Baltimore County.

Christopher Lekas is a Principal Software Engineer at Splunk and quality owner for the AI Assistant in Observability Cloud. He holds a B.A. in computer science and economics from Swarthmore College.

Amin Moshgabadi is a Senior Principal Software Engineer at Splunk. He holds a B.A.S. from Simon Fraser University.

Akila Balasubramanian is a Principal Software Engineer and the technical owner of the AI Assistant in Splunk Observability Cloud. She is very passionate about building products that help monitor the health and performance of applications at scale. She is a huge believer in dogfooding products and closely collaborating with customers to get direct, candid feedback. She enjoys leveraging her analytical and creative skills to solve problems and deeply values quality over anything else. She holds a Masters degree in Computer Science from the University of Illinois.

Sahinaz Safari is a Director of Product Management, and the head of AI in Observability. She has a long track record of building and scaling innovative products based on cutting-edge technologies in the Observability domain. Sahinaz has a MS in Electrical Engineering from Stanford University and a MBA from UC Berkeley.