10/25/2024 | News release | Distributed by Public on 10/25/2024 09:57
As large language models (LLMs) continue to advance in complexity and resource consumption, the demand for powerful, efficient and scalable hardware platforms has never been greater. Intel® Gaudi® 2 AI accelerator nodes have emerged as a strong contender in this space, promising high performance at scale. Recently, in collaboration with our longtime technology ally Intel, we tested the impact of running LLMs on Intel® Xeon® CPUs with the Intel® AMX AI acceleratorhttps://www.presidio.com/blogs/llms-on-intel-xeon-cpus-with-intel-amx/> and data showed it to be a cost-effective alternative for specific applications like chatbots and productivity tools.
Now, Presidio and Intelhttps://www.presidio.com/partners/intel/> have teamed with Denvr Dataworks to explore the capabilities and impact of utilizing the Intel Gaudi 2 AI accelerator in LLM inference and Retrieval-Augmented Generation (RAG) pipeline creation while running on Denvr Cloud. We conducted a series of benchmarks using the Mistral Large 2 model to evaluate the performance of Intel Gaudi 2 technology under different levels of load.
The results were impressive, particularly in terms of delivering reliably high performance under load, scalability, and simplifying setup. In addition, testing showed several advantages from Denvr Cloud. Running large-scale LLM inferencing and RAG pipelines on the Intel Gaudi 2 AI accelerator addresses the need for cost-effective scaling, improving both throughput and latency for AIhttps://www.presidio.com/solutions/ai/> workloads. Intel Gaudi 2 technology provides optimized energy efficiency and supports high-concurrency inferencing, which is crucial for handling large models and real-time responses. It seamlessly integrates with popular frameworks, making deployment of generative AI (GenAIhttps://www.presidio.com/blogs/generative-ai-versus-traditional-ai-in-business-applications/>) applications easier and more efficient.
We'd like to share insights into why we chose to test Intel Gaudi 2 technology on Denvr Cloud, break down our key findings and explain why Intel Gaudi 2 AI accelerators offer a compelling option for customers working with LLMs and RAG on Denvr Cloud. Lastly, we'll provide detailed insight into how the test was conducted and the specific results that were achieved.
Why Test Intel Gaudi 2 AI Accelerator and Denvr Cloud?
Second-generation Intel Gaudi AI accelerators are co-designed by Intel and Habana Labs, an Intel company, specifically for deep learning workloads and AI inference, aiming to deliver optimal performance and scalability for AI tasks. It features Habana Processing Units (HPUs) with RoCE networking for low-latency communication, and high-bandwidth memory, making it a solid option for large-scale model deployments. Intel Gaudi 2 technology is supported by the Intel® SynapseAI software stack, which integrates seamlessly with popular frameworks like PyTorch, streamlining development and deployment for AI teams.
The primary appeal of using Intel Gaudi 2 AI accelerator nodes lies in their ability to handle large-scale model deployments with ease, making them an attractive option for enterprises and research institutions alike. This made Intel Gaudi 2 technology an ideal candidate for testing with one of the most demanding models: Mistral Large 2.
Denvr Dataworks offers a variety of cloud services designed to simplify and accelerate AI development, providing scalable access to advanced AI accelerators like the Intel Gaudi 2 product. Their flexible consumption model enables AI and machine learning (ML) workloads without the need for significant upfront capital investment.
Denvr Cloud's advanced AI infrastructure - with Intel Gaudi 2 technology, supercomputer-level networking, and flexible cloud services optimized for AI workloads - includes on-demand and reserved compute. It supports key features like AI training, inference, and RAG-as-a-Service, alongside automation for deploying AI frameworks and foundational models. Denvr Cloud also provides tools for AI research and data processing with no access fees for data transfer, making it an attractive option for AI development and deployment.
Benchmarking: Mistral Large 2
We evaluated Mistral Large 2 for text generation on Intel Gaudi 2 nodes under varying batch sizes, loads and levels of concurrency. Our objective was to measure Intel Gaudi 2 technology performance, focusing on key metrics like throughput (tokens/second), time to first token (TTFT), and memory usage.
The benchmarks covered a range of concurrent requests - 500, 1,000, 4,000, and 10,000 - to assess how well the Intel Gaudi nodes handled:
The maximum input token size differed for each input based on the prompt size. It varied over a range of 1,024 to 4,096; the average output token size was 1,024. We expect our customers to be able to reproduce similar results with comparable volume of knowledge documents, chunk sizes, batch sizes, text prompts, and concurrency levels.
Our goal was to determine how well Intel Gaudi 2 technology optimizes an Inference-as-a-Service use case-an essential application for businesses looking to enhance their LLMs, especially for AI-driven tasks like RAG pipeline creation.
Performance Metrics and Results: Throughput, Latency and Memory Usage
Our benchmarks yielded the following results, summarized in Table 1 in the Appendix. Figure 4 below provides a chart representation of throughput (tokens/second), tokens per second per card, throughput per card, and time to first token at the different concurrency levels.
Results by Load Level:
Figure 4. Test results graph.
Testing Reveals Key Advantages of Intel Gaudi 2 AI Accelerator
Based on our testing, several features of Intel Gaudi 2 nodes stood out:
Conclusion
Our test results show that the Intel Gaudi 2 AI accelerator is a strong option for developers deploying large-scale AI models. Its high performance, scalability, and ease of use make it particularly compelling for LLM inference and RAG pipelines, especially for those targeting efficient deep learning workloads.
Throughout our benchmarks, Intel Gaudi 2 technology handled high concurrent requests with impressive scalability, demonstrating its ability to efficiently manage AI-driven tasks like RAG pipeline creation. Even under heavier loads, it maintained reliable performance, with reasonable latency and effective memory usage.
Presidio regularly conducts tests like this to discover how new solutions from Intel and other ecosystem allies can deliver ever-greater results to solve customer challenges. We encourage AI developers to explore the capabilities of Intel Gaudi technology firsthand. With its scalable architecture, optimized software stack, and availability on Denvr Cloud, it provides a unique platform for testing deep learning workloads and LLM inference at scale.
Contact Presidiohttps://www.presidio.com/contact-us/> to learn more and start experimenting with the Intel Gaudi 2 AI accelerator on Denvr Cloud to see how it can speed your own LLM models and AI workloads.
Appendix: Details from the Testing and Results
Test Architecture:
The test architecture (Figure 1) begins with the Presidio Knowledge Base as the foundational data source, which feeds into the RAG Testing Pipeline, processing over 1,000 diverse inputs. These inputs simulate various user interactions or scenarios by generating 10 unique personas, creating a robust framework for testing. The personas then produce 10,000+ RAG prompts, rigorously testing the system's capabilities.
At the core of the setup is the LLM middleware application, interfacing with powerful models like Mistral Large 2. This model is sharded across multiple nodes for inference. The middleware was designed to optimize the model performance on Intel Gaudi 2 AI accelerators on the Denvr Cloud platform, focusing on both hardware efficiency and advanced data management.
Our key performance metrics included throughput (tokens/second), time to first token (TTFT), and allocated memory usage, as these directly reflect the hardware performance. Model-specific metrics such as contextual recall, precision, relevancy, answer veracity, and security were secondary, with the primary being how well Intel Gaudi 2 technology supports LLM performance.
Figure 1. Intel Gaudi 2 AI accelerator LLM and RAG test architecture.
Intel Gaudi 2 Node Cluster on Denvr Cloud
The Intel Gaudi 2 node cluster on Denvr Cloud (Figure 2) consists of four Intel Gaudi 2 nodes, each connected via a high-throughput, low-latency RoCE v2 Ethernet network, ensuring efficient AI traffic communication between nodes. These nodes are linked to high-performance storage, providing fast access to data for AI workloads. The architecture is optimized for Machine Learning (ML) and AI inference tasks, leveraging Intel Gaudi 2 technology's compute power, RoCE low-latency networking, and robust storage.
Figure 2: Intel Gaudi 2 node cluster on Denvr Cloud network configuration
Each Intel Gaudi 2 node (Figure 3) is built on a dual-socket Intel® Xeon® Platinum 8380 processor-based server with 80 cores and 1TB RAM. The node includes eight Intel Gaudi 2 AI processor cards. Each card delivers up to 432 TFLOPS of peak performance for BF16, and 865 TFLOPS of peak performance for FP8 (Reference). Each Gaudi processor card has 96GB of HBM2e memory, providing 2.45 TB/s of bandwidth. The node is connected via 24x 100Gbps RoCE v2 Ethernet ports for high-speed networking and is deployed in a rack-mounted configuration. The Denvr Cloud infrastructure supports this setup, offering both direct-attached NVMe and RoCE networking to ensure seamless data communication between nodes and networked storage.
Figure 3. Intel Gaudi node with 8 Intel Gaudi 2 AI processor cards.
Software Components
For LLM inference on Intel Gaudi 2, several key software components work together to optimize performance.
Together, these tools create an optimized environment for running LLM inference workloads on Intel Gaudi 2 technology.
Review of Results by Load Level
Table 1. Test Results at a Glance.
Requests | Tokens/sec (Per Card) | Time to First Token (TTFT) | Throughput (Tokens/sec) | Memory Usage per Card |
500 | 350-400 | ~240ms | 3,069.06 | Allocated: 37.22 GB, Max: 45.24 GB, Available: 94.62 GB |
1,000 | 650-700 | ~400ms | 5,592.99 | Allocated: 40.07 GB, Max: 64.5 GB, Available: 94.62 GB |
4,000 | 900-1,000 | ~1,200ms | 6,692.65 | Allocated: 54.2 GB, Max: 94.56 GB, Available: 94.62 GB |
10,000 | 900-1,000 | ~3,400ms | 6,966.64 | Allocated: 82.49 GB, Max: 94.61 GB, Available: 94.62 GB |