Oracle Corporation

26/06/2024 | Press release | Distributed by Public on 26/06/2024 23:35

Deploying DBRX LLMs on Oracle Cloud Infrastructure with NVIDIA NIM ...

The recent Databricks Data+AI Conference included a presentation on deploying DBRX large language models (LLMs) on Oracle Cloud Infrastructure (OCI) using NVIDIA Inference Microservices (NIM) and Delta Sharing. This blog post summarizes the key points and highlights from the presentation, focusing on the architecture, benefits, and deployment process.

Let's start with the end architecture, where we aggregate data from disparate data sources into OCI for use in a classic retrieval-augmented generation (RAG) workflow. We're using the DBRX LLM deployed using NVIDIA NIM services on NVIDIA L40S on OCI. We have also validated the workflow on NVIDIA H100 running on OCI.

Aggregating data with RAG

This diagram outlines a workflow involving data processing, embedding, querying, and inference within OCI services. This workflow highlights the integration of various components to process data, create embeddings, store, and query vector data, and perform inference using advanced NVIDIA GPUs within OCI. It moves through the following stages:

  1. Data sources: The workflow starts by aggregating data from disparate sources that can be accessed over various network transports, such as the internet, OCI FastConnect, OCI-Azure Interconnect, and IPSec VPN. It then uses the OCI backbone with intra- or interregional VCN peering. In this workflow, we're using Databricks Delta Sharing protocol. Data is shared using Delta Sharing, which allows batch or streaming data ingestion. Delta sharing has advantages over legacy data replication services, which we discuss later in this article.
  2. Data processing: The ingested data is processed by dense Compute instances. This step involves creating chunk embeddings from the data. We're using OCI dense Compute shapes for this function.
  3. Oracle vector database: The processed embeddings and chunks are ingested into a vector database. We use Oracle 23ai, a specialized database designed for handling vector data. Based on these embeddings, queries are created.
  4. Embedding model: An embedding model creates chunk embeddings and query embeddings that facilitate efficient querying and retrieval. We can use the API-based Cohere Embedding Model from the OCI Generative AI service. In this case, we're using open-source Meta AI Llama Embeddings deployed as NVIDIA NIM.
  5. Inference cluster: The NVIDIA GPU-enabled compute cluster is used for neural inference and model serving (NIMS). This module uses the Databricks open source DBRX LLM and serves the following functions:
  • Returning chunks based on the query
  • Creating context from chunks
  • Supplying context from chunks to the inference models

This NIM runs on clusters that you can equip with different GPUs, such as NVIDIA L40S, A100, H100, and H200, to produce results according to the performance that customers need. These GPUs are optimized for different performance levels, catering to specific needs.

  1. Return inference results: The final stage of the workflow is the output. The inference results are returned, completing the workflow.

Oracle AI

Let's take a detour and discuss the complete Oracle AI Stack.

Oracle's AI stack is designed to integrate seamlessly with various applications, from Fusion applications and NetSuite to third-party applications and industry-specific solutions. It uses embedded generative and classic AI capabilities across its AI Infrastructure.

Oracle offers the following services for AI usage:

  • Fusion applications
  • Fusion Analytics
  • NetSuite
  • Industry applications
  • Third-party applications: These applications integrate embedded generative AI and classic AI capabilities, indicating a broad range of software solutions leveraging AI technology.

The following AI services have GenAI capabilities:

  • Generative AI: Newly introduced Generative AI capabilities
  • GenAI Agents: Newly introduced agents leveraging GenAI.
  • Digital Assistant: AI-driven virtual assistants
  • Speech: AI services focused on speech recognition and processing
  • Language: AI services for natural language processing
  • Vision: AI services for image and video analysis
  • Document Understanding: AI for document processing and comprehension

The following services offer machine learning (ML) and GenAI features for data platforms:

  • Oracle Database Vector Search: Newly introduced vector search capabilities in Oracle Database
  • Autonomous Database Select AI: Newly introduced AI capabilities in Oracle's autonomous database
  • MySQL HeatWave Store and GenAI: Integrates GenAI capabilities with MySQL HeatWave Store
  • Data Science: Platforms and tools for data science workflows
  • ML in Oracle Database: Machine learning capabilities embedded within Oracle Database
  • MySQL HeatWave AutoML: Automated machine learning in MySQL HeatWave
  • Data Labeling: Tools and services for labeling data for machine learning

The following services provide AI infrastructure:

  • GPU Compute: Various options, including bare metal, virtual machines (VMs), and Kubernetes clusters
  • Storage: Includes block, object, file storage, and high-performance computing (HPC) file systems
  • Superclusters: Cluster networking supporting up to 64K GPUs using RoCEv2 protocol (RDMA of cluster network)

Applications connect to AI services, suggesting that these applications use the capabilities provided by AI services. AI services are built on top of ML and GenAI for data platforms, indicating that AI functionalities utilize ML and generative AI capabilities provided by these platforms. Data platforms are supported by underlying Database-centric applications and AI Infrastructure, highlighting the foundational role of high-performance computing and storage infrastructure in supporting advanced AI and ML functionalities.

NVIDIA Inference Microservices

NVIDIA NIM is a modular, containerized service optimized for deploying and scaling AI inference workloads. It includes the following benefits:

  • Scalability: Easily scale services to handle varying workloads
  • Flexibility: Deploy and update services independently
  • Efficiency: Optimize resource usage and reduce latency

NIM enhances developer productivity and infrastructure efficiency, allowing enterprises to maximize their investments. For example, running Meta Llama 3-8B in NIM produces up to three times more generative AI tokens on accelerated infrastructure than without NIM.

Oracle Cloud Infrastructure

OCI is a Generation 2 cloud that offers a differentiated approach with high innovation, flexibility, and the lowest total cost of ownership (TCO) among other hyperscalers and cloud providers. It supports robust AI and ML innovation through a full-stack AI strategy across infrastructure, platform, and software as a service (IaaS, PaaS, and SaaS).

OCI includes the following key highlights:

  • Bare metal NVIDIA GPU training and inference IaaS
  • High bandwidth (RoCEv2): 3200 Gbps per node
  • NVMe storage: 61.4 TB per node, leading to superior performance
  • Cluster size: 1-8,000 nodes
  • Number of GPUs in a cluster: 8-64,000 NVIDIA GPUs

DBRX: A cutting-edge LLM

DBRX is a transformer-based decoder-only LLM with a fine-grained mixture-of-experts (MoE) architecture, featuring 132 billion total parameters with 36 billion active on any input. It outperforms other open MoE models like Mixtral and Grok-1 because of its larger number of smaller experts and improved model quality.

DBRX uses advanced techniques, such as rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It was pretrained on 12 trillion tokens of text and code data, using the GPT-4 tokenizer for optimal performance.

Delta Sharing: Secure data collaboration

Delta Sharing is an open protocol for secure and seamless data sharing across organizations and platforms, working with Delta Lake for reliability and performance. It prioritizes the following core concepts:

  • Providers: Entities sharing the data
  • Shares: Logical groupings of tables from a Delta Lake table
  • Recipients: Individuals accessing the shared data

Key advantages of Delta Sharing include live data sharing, no data copying, platform agnosticism, and secure governance.

Deployment process on OCI

The deployment of DBRX LLM on OCI involves the following steps steps:

  1. Initial setup
    • Provisioning high-performance Compute instances
    • Configuring scalable storage solutions
    • Setting up secure networking
  2. Integration with NVIDIA NIM
    • Containerization using Docker and Kubernetes
    • Deploying and scaling microservices for various inference tasks
  3. Technical details
    • Scalability through auto-scaling and load balancing
    • Efficiency and performance optimization

Key benefits

Deploying DBRX LLM on OCI with NVIDIA NIM offers the following benefits:

  • Scalability: Seamless scalability to handle varying workloads
  • Efficiency: High-performance, real-time inference services
  • Security: Robust platform with comprehensive security features

The OCI North America Cloud Engineering AI Solutions team has automated the entire deployment using APIs and terraform.

Conclusion

This presentation underscored the synergy between OCI and NVIDIA's AI capabilities, highlighting how this powerful combination facilitates efficient handling of large-scale data and advanced AI workloads. The deployment process on OCI ensures scalability, efficiency, and security, making it an ideal choice for enterprises looking to utilize cutting-edge AI solutions.

For more information and to try out NVIDIA NIM on Oracle Cloud Infrastructure, visit NVIDIA AI and Oracle Cloud Infrastructure.