10/29/2024 | Press release | Distributed by Public on 10/29/2024 15:37
Imagine you're working on an AI product that can summarize customer success phone calls for training purposes. Your company's product leverages large language models (LLMs) to summarize, synthesize, triage, and generate relevant outputs. You're aware that LLMs can hallucinate, output harmful or biased text, or be manipulated through prompt injection attacks. As a responsible employee, you want to run more robust tests than routine acceptance testing, such as using more data and expanding the risk surface. What do you need to run those tests? Practitioners trying to design, implement, and execute these tests are trying to answer this question. This blog post provides our high-level answer, describing four components we recommend when designing, implementing, and executing a test:
The Responsible AI & Tech team at Salesforce has performed several internal red teaming activities enhancing the efficiency and safety of our AI products. Read more below for a deeper understanding of each component.
The bedrock for any test of an AI system is "high quality" data. But what does it mean for data to be high quality? We focus on three aspects of high quality data that represent some of the bigger hurdles facing organizations today: Use Case Specific Data, Data Storage for Reproducibility, and Data Maintenance.
High quality data is contextual, whether you are doing a broad adversarial test for a model or a deeper test for a product. Here are some tips for creating use case specific data:
Each of these have their pros and cons in terms of cost, time, and efficacy, but having at least one mechanism for generating data is important to ensure you can test.
Now that the data has been procured, storing it is the next step to having high quality data. There are a couple of key components when storing your data:
In an ideal world, your team would set up a proper database to store your data. However, if people are more comfortable with spreadsheets, start with that and evolve the data storage strategy over time. The most important piece is to have a clear data storage strategy in place and then evolve it into something sustainable for the enterprise.
3. Data Maintenance
Lastly, high quality data requires maintenance. This is a shift in mindset: instead of thinking of data as a static object that an enterprise collects, think of data as a resource to continue building on. Common issues include data that has been stored for a long time, requiring you to collect new data for similar purposes, or having something adjacent to the exact data you need. There are myriad other examples that are highly specific to each use case, but reconciling the fact that data needs to be kept up to date is crucial for the effectiveness of testing.
When in pre-production, most organizations will have a user interface to do testing. But that doesn't scale well if you want to use hundreds or more data points. This is why having code to programmatically access products is necessary. This might sound basic, but unless your product was built to have programmatic access, this would be very hard. We recommend:
In a future blog post, we will discuss how building such a package then affords the creation of automated red teaming and what components should be built for that.
Once model outputs are received, they need to be evaluated to determine how well the system performed. To report on this with confidence, you'll need to create crisp definitions that can be used as a calibration point for all stakeholders. For example, for toxicity testing, you must define what toxicity is for your use case. Determining how specific you want to be in your conceptualization early on will save you a lot of refactoring. This work should be done in tandem with product and engineering teams to create the best fit for your use case.
Some best practices for designing, maintaining, and implementing taxonomies and standards are as follows:
Once a taxonomy has been established and all stakeholders are in agreement, testing becomes straightforward. A well-crafted taxonomy feeds forward into automated processes, providing rigor and structure to automatic labeling and prompt generation. Standards and definitions, especially in domains with high subjectivity (like ethics), allow for more effective programmatization of these higher-level concepts into testing infrastructure.
You gathered your data. A codebase has been established to easily automate tests. The organization and data have been aligned to a taxonomy. The last layer of infrastructure is how you communicate and interact with product teams. This can range from integrating an RAI expert into the team to advising teams on how they should design and implement tests.
At Salesforce, product teams fill out a form that collects details such as how the product will function, what ethical safeguards already exist, and so on. The answers to those questions get triaged by our team of Responsible AI & Tech product managers. If they determine that the product needs to be reviewed, they produce a Product Ethics Review, which classifies the various potential risks and downstream harms. Depending on the nature of the product or the model, the risks identified can be narrow and deep or diverse and broad.
The Testing, Evaluation, and Alignment team within our Responsible AI & Tech team then design tests with the product team around the identified risks. A test plan is generated to manage stakeholder expectations, as well as to scope out any technical work. During the execution of the test, labeling guidelines and mental health guardrails may be developed to facilitate the labeling of harmful outputs. Results are analyzed, bugs are reported, and a report is written for leadership. Once mitigations are implemented, follow-up tests are done as well to ensure risks have been reduced.
Any time an organization tests, it will need data to execute the test, a way to programmatically access the product, taxonomies for alignment, and a process for communication and execution. While the goal is to have this infrastructure set each time a test is executed, sometimes we have to make ad hoc data, and sometimes our test plans are not written with the most detail. But we use this as our North Star, something to aspire to every time we do a test. Because at the end of the day, if we can execute high quality tests quickly, that will reduce the amount of time we need to deliver our results, and increase the amount of time product teams need to mitigate them, ensuring products can be shipped safely.