Huawei Technologies Co. Ltd.

08/08/2024 | Press release | Distributed by Public on 08/07/2024 20:27

Finding the Right AI-Ready Data Infrastructure for Cloud and Internet

We've talked about finding the right AI-ready data infrastructure for intelligent computing centers, so now, let's check out what infrastructure fits cloud and Internet scenarios.

Cloud and Internet: Is your infrastructure holding you back?

In today's digital age, the importance of AI-ready data infrastructure for businesses cannot be overstated. For cloud and Internet, effective data infrastructure lays the foundation for smooth, efficient, and scalable AI applications. Cloud and Internet companies are at the forefront of AI research and applications because of their vast resources and technological prowess. They often deploy large-scale AI clusters with tens of thousands of GPUs to support complex computational power demands and large AI models like large language models (LLMs). This poses unprecedented challenges to their data infrastructure.



The three major challenges that hinder AI workloads

AI workloads present significant challenges for cloud and Internet companies, especially when it comes to scaling up operations to meet the demands of ultra-large AI clusters. These challenges can significantly impact the efficiency and cost-effectiveness of AI initiatives.

1. Inefficient training

One of the primary hurdles you might face is inefficient training within ultra-large AI clusters. For optimal LLM training, datasets need to be evenly distributed across GPUs to ensure high-bandwidth communication. Your data infrastructure needs global load balancing, end-to-end NVMe connections, and dynamic adaptable data layout (DADL) capabilities to achieve ultra-high throughput and quickly load checkpoint data.

2. Shaky operations

Ensuring stable operations during the training of an ultra-large AI cluster is another critical challenge. These clusters comprise a huge number of servers and GPUs, and the failure of even a single component may slow down algorithm convergence and further delay product time to market.

3. Expensive scaling

Scaling AI operations to accommodate growing demands is often costly. The need for large bandwidth and high-performance computing resources can lead to significant financial investment. Moreover, the operating costs associated with cabinet space, energy consumption, and system maintenance add to the financial strain, making scaling a challenging endeavor for many companies.

Essential features that make smooth AI workloads a breeze

Tackling these challenges and ensuring your AI initiatives are future-proof require AI-ready data infrastructure that has a few key features.

1. High performance: Dozens to hundreds of GB/s read/write bandwidth delivered by each device

Truly AI-ready infrastructure must deliver high read and write bandwidth to improve the computing power utilization of clusters. This can mean leveraging technologies like global load balancing to evenly distribute read/write requests across controllers and disks. It can also mean E2E NVMe for connections. Compared with SCSI, NVMe reduces host network stack overheads by 40%, offers a direct and shorter path to CPUs and SSDs, and requires only two interactions instead of four. These technologies ensure that data flows smoothly and efficiently, reducing bottlenecks and enhancing the overall speed of data processing and AI model training.

2. Rock-solid reliability: 99.99% single-node reliability and 99.999% cluster reliability

Reliability is non-negotiable. Single-node reliability of 99.99% and cluster reliability of 99.999% ensure that your AI systems are always operational, minimizing downtime and maintaining continuous data availability - both of which are critical for real-time AI applications and decision-making processes.

3. Wide compatibility and ultra-low TCO

Wide compatibility is essential for data infrastructure if you want to integrate a variety of AI tools and platforms, as it significantly reduces the total cost of ownership (TCO). An AI-ready data infrastructure that is compatible with mainstream AI computing platforms (like CUDA and MindSpore) and parallel file systems (like Lustre, GPFS, and BeeGFS) not only simplifies operations but also ensures that you can leverage the best tools available without being hindered by compatibility issues.

AI cluster reference architecture for cloud and Internet scenarios

Conclusion

Huawei is an industry leader with over 20 years of extensive investment in data infrastructure. It offers a broad range of products, solutions, and case studies to help you create a reliable, high-performance, and cost-effective data foundation for your AI applications. Learn about our award-winning OceanStor Data Storage and how to unleash the full potential of your data.

Disclaimer: Any views and/or opinions expressed in this post by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Huawei Technologies.

Share this: