Oracle Corporation

06/27/2024 | Press release | Distributed by Public on 06/27/2024 10:23

From inference to RAG: Choosing CPUs for efficient generative AI ...

In the previous blog releases, we demonstrated that CPU is a globally available and cost-performant option for running sub-15B LLM inference workloads. By improving Ampere ARM 64 CPUs' ability to process enhanced model weights, manage thread efficiency and build new compute kernels, Ampere Computing and Oracle Cloud Infrastructure (OCI) delivered up to 152% performance gain over the current upstream llama.cpp open-source implementation, and that translates to 158 tokens per second with 32 concurrent requests.

In this blog post, we further explore the value of CPUs in supporting retrieval augmented generation (RAG) with vector embeddings. We have published containers and examples on GitHub and Docker hub repositories. Check out the following sections on the scenarios enabled and how you can build production ready applications efficiently.

RAG with vector embeddings on CPUs

RAG is a popular way to customize the LLM responses with your custom data. Most companies start with a pretrained large language model (LLM) and then augment a response using data that's relevant to the business without having to fine-tune the entire LLM model. RAG is now a common technique used by AI engineers to increase model relevance in a fast-evolving data environment given its flexibility and ease of implementation. To perform RAG, companies need to first identify its vector embeddings strategy. Using embeddings is a way of representing data: Taking an input data, such as PDFs and text files, and converting them to numerical format with semantic relationships. These vectors in numerical format are stored in a vector database, such as Oracle 23 AI or Chroma DB, and it requires a LLM model specialized in embeddings to perform the task. Embeddings LLM models are often small (less than 1 GB), making CPU a cost- and energy-efficient infrastructure choice. For the most popular embeddings models, see the Hugging Face leaderboard.

As part of Ampere AI team's June software releases, a full support for LangChain, Llama Index-based RAG examples using Chroma DB as a vector database was released. The following workflow illustration shows an example.

Performance results: Running embeddings model on CPUs

We ran extensive benchmarking using optimized Ampere A1 CPU. Using 80 A1 cores, we were able to process 1 GB of embeddings text data in less than seven minutes, which translates to approximately $1.4 for 10 GB of data vectorization. With the right-size compute and worldwide availability of Ampere A1, you can deploy an OCI Compute instance in the region of choice and dynamically scale CPU capacity based on your embeddings needs. The following calculation uses all-MiniLM-L6-v2, the most downloaded OOS Embeddings model on Hugging Face, with 1024 batch size of Ampere A1:

CPU cores 4 8 16 32 64 80
Throughput (MB/min) 7 14.5 29 58 116 145

Testimonials

As part of our ongoing commitment to innovation and unlocking the value of CPU in the field of generative AI, OCI and Ampere Computing are actively working to expand scenario support with our customers and ecosystem partners:

"At Lampi, we are deeply committed to efficiency and sustainability in AI deployments in production at scale. Our hybrid approach using CPUs over GPUs for many business processes makes our collaboration with OCI and Ampere Computing valuable. Indeed, while GPUs' performances are impressive for live inference and execution speed, we are convinced that many AI tasks can be allocated to more energy-efficient CPUs, considering that many business workflows do not require instant inference and can be efficiently managed with AI in an asynchronous manner, including RAG applications. To illustrate this, on our platform running with Ampere A1 on OCI, a reasoning AI agent was able to conduct a comprehensive market analysis, involving multiple queries through a RAG pipeline, in just five minutes and 31 seconds. Even on GPUs, agentic workflows generally lengthen the time required to get an answer, considering the number of inferences from the agent performing multi-searches. Similarly, supervised AI agents running on CPUs could be envisaged to autonomously perform redundant weekly tasks based on RAG, such as monthly portfolio analysis, performance review, transcription categorization, customer feedback analysis, and market research. As we stand on the brink of a new era in AI and computing, we think that our collaboration serves as a blueprint for the AI industry and companies to allocate computing power more efficiently, demonstrating that it's possible to achieve technological excellence without compromising on environmental integrity. " - . Guillaume Couturas, CEO of Lampi. To learn more about Lampi's offering, visit Lampi website.

"Wallaroo can natively enable deployment and management of embedding models available in the HuggingFace board, directly on Ampere, without any hardware or infrastructure manipulation. Additionally, Wallaroo's inference workload automation capabilities and integrations toolkit provide support to most vector databases today, enriching vector databases with more context and embeddings, on demand and on a schedule, as new data becomes available. As a result, data scientists and AI engineers building and deploying LLMs in production can ensure a tight feedback loop and continuously improve the quality of LLMs running in production with low operational burden. In the next month, we will publish a full RAG LLM solution that includes a live inference endpoint to power real-time chatbot use cases and context database enrichment using vector databases. We plan to take advantage of our full llama-cpp integration on Ampere and Oracle's 23ai to ensure Wallaroo's compatibility with the data and AI ecosystem in OCI and highlight the operational efficiency we can deliver with the integration." - Younes Amar, VP of product at Wallaroo, To learn more about Wallaroo offering, visit Wallaroo website.

These testimonials underscore our shared commitment with Ampere to help democratize generative AI and make this technology available to all in a cost-efficient manner.

Getting-started

To get started with Ampere on OCI, existing customers can launch the custom OS image in the Oracle Cloud Marketplace with both Oracle Linux and Ubuntu support. The image is bundled with applications including a chat UI to help you deploy and validate OSS LLMs like Llama 3 8B on Ampere instance. For embeddings workload on A1, follow the end-to-end example of how you can use Lang chain or Llama Index to vectorize your custom text-based dataset and use it to augment results for Llama 2 7B LLM model. You can also find benchmark scripts published in the GitHub repository.

We're offering OCI credits for up to three months of 64 cores of Ampere A1 and 360 GB memory) to assist with validation of AI workloads on Ampere A1 flex shapes with credits ending before December 31, 2024. Work with your sales rep to obtain the credits or sign up if you're new to Oracle Cloud Infrastructure.

For more information, see the following resources: