Oracle Corporation

09/30/2024 | Press release | Distributed by Public on 09/30/2024 14:02

Empower generative AI inference performance using NVIDIA NIM on OCI

Generative AI (GenAI) inference refers to the process of using a trained generative artificial intelligence model to generate outputs or predictions based on new input data. During this inference phase, the model applies what it has learned during training to produce results, such as text, images, or other types of data. Choosing the right inference engine is crucial for processing input data and generating accurate predictions and real-time responses efficiently. Effective inference enables models to handle large volumes of requests quickly and at scale, which is essential for applications requiring real-time responses or high throughput, such as search engines or interactive AI assistants. NVIDIA offers a prebuilt inference engine in a container that you can easily deploy in cloud environments, addressing customer challenges and improving performance.
How to choose the right inference engine
You must consider several factors when choosing the right inference engine for a GenAI workload, including the efficiency, scalability, business value, complexity of deployment, maintenance, managing the inference engine, and the need for timely technical support. However, enterprises primarily focus on two key aspects: Total cost of ownership (TCO), measured by cost per token, and end-user experience, assessed by time to first token (TTFT) and inter-token latency (ITL). To address this complex trade-off, we ran the performance benchmark on a leading open source inference engine against NVIDIA NIM microservices.
Introducing NVIDIA NIM
Part of NVIDIA AI Enterprise, NVIDIA Inference Microservices (NIM) is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across the cloud, data centers, and workstations. Supporting a wide range of AI models, including open source community and NVIDIA AI Foundation models, it encourages seamless, scalable AI inferencing, on-premises or in the cloud, leveraging industry standard APIs.
NVIDIA NIM is designed to bridge the gap between the complex world of GenAI development and the operational needs of enterprise environments. This design allows for a significant increase in the number of enterprise application developers who can contribute to AI transformations within their companies, with potential for even greater growth in the future. NVIDIA NIM includes prebuilt containers, industry-standard APIs, support for custom models, domain-specific code, and an optimized inference engine.
Why NVIDIA NIM on OCI?
Oracle Cloud Infrastructure (OCI) is the first public cloud of its class built from the ground up to excel as a platform for every application. Customers are increasingly choosing OCI for their cloud workloads for its numerous advantages. OCI provides NVIDIA GPU-based bare metal and virtual machine (VM) Compute shapes optimized for running NVIDIA NIM, delivering exceptional performance and ease of management compared to other cloud providers. Oracle also offers world-class database solutions, which many enterprises use for various applications and securely store the data in OCI storage. With flexible network connectivity across OCI database products, customers can seamlessly use this data to train or fine-tune LLM models or to connect it directly for retrieval-augmented generation (RAG) workflows.
Why NIM over an open source inference engine?
We compared NVIDIA NIM and an open source inference engine in areas of setting up the environment, deploying the container, comparing LLM inference performance, and evaluating support. Based on these aspects, we concluded that NIM offers a superior developer experience compared to the open source Inference engine. The following section shows the sample performance comparison between NVIDIA NIM and the open source Inference engine.
Performance comparison
We evaluated NVIDIA NIM and an open source inference engine, both running on OCI Compute hosts equipped with NVIDIA H100 GPUs and conducted performance benchmarks. For the evaluation, we used the large language models (LLMs) Llama 3-8B for small-to-medium LLM workloads and Llama 3-70B for large LLM workloads. Our comparison focused on throughput in tokens per second, intertoken latency, and time to first token across different batch sizes, numbers of input tokens, and numbers of output tokens.
As a result, we found that the inference performance of NVIDIA NIM is 1.5-3.7 times better than that of the open source inference engine across various scenarios. We also observed a significant performance improvement with NIM-more than two times-compared to the open source inference engine when using a higher-concurrency batch size, which is common in mid-to-large enterprises. This performance boost is particularly evident in text generation and translation use cases. NIM also consistently outperforms the open source Inference engine with larger model sizes with a larger number of parameters.
The following graph shows a sample performance comparison between NVIDIA NIM and the open source inference engine. It compared the number of total tokens per second (TPS) and the input and output token size.
Conclusion
NVIDIA NIM provides a set of easy-to-use microservices designed to accelerate the deployment of generative AI models on OCI. NVIDIA NIM brings the power of state-of-the-art LLMs to enterprise applications, providing unmatched natural language processing (NLP) and understanding capabilities. Enterprises can run NIM containers in Oracle Cloud Infrastructure bare metal compute instances to utilize NVIDIA's performance advantages over open source offerings. You can reach out to Oracle or NVIDIA to know more about NVIDIA NIM on OCI.