Imagination Technologies Group Ltd.

10/03/2024 | Press release | Distributed by Public on 10/04/2024 04:22

Why Low-Level Libraries are the Key to AI Success

Let's start with…. the end.

The old saying, "you get out,what you put in", is possibly the easiest way to summarise the sentiment of the next few paragraphs as we introduce Imagination's new OpenCLcompute libraries. If you have no time to read further,just take away the message that we've been able to squeeze a lot more compute and AI performance out of the GPU because we have put a lot of work into the careful design of these new software libraries,so that our customers don't have to.

For some customers this is everything they need from an out-of-box experience to get the job done. For other customers, particularly those who are developing their own custom libraries/kernels, Imagination's compute libraries, along with supporting collateral and tools, are the perfect starting point to success in their development and performance goals.

The end.

And for those who felt that ended too soon….

Imagination has been building GPUs with OpenCL support for compute applications for many years.   We work with numerous businesses who have their own NPUs but need a GPU (general purpose GPU) to give them a level of programmability that NPUs typically don't provide.  We're also seeing a general realisation in the market that flexibility is essential, especially when on the "functional to performant to optimal"developer journey for one's own compute algorithms.  Our previous blog, "The Myth of Custom Accelerators", discusses the benefits of general-purposeacceleration over domain specific acceleration and emphasises that developer enablement comes down to providing the right software for the job.

What is the right software?

Math and NN libraries are broadly understood as essential building blocks for the efficient execution of AI applications and other compute intense workloads on a programmable platform. In-cabin driver monitoring applications, Lidar, Radar, Vision pre/post-processing algorithms and even the key processing elements of transformer blocks for foundation models like LLMs all rely upon underlying optimised libraries.

The demand for these essential building blocks has produced a plethora of open-source projects (clBLAS, vkFFT, xnnpack to name a few) that any developer can now quickly access and use within their applications to get them functional.

However, the initial enthusiasm for getting the job done is often followed by disappointment as the developer realises the performance with open-source libraries can be far below their expectations given the available hardware TFLOPS/TOPS. Disappointment then soon turns into a lingering frustration as the user faces the reality that deep knowledge of the hardware micro-architecture and developer tools is required to do very much about it.

Is this a new problem?

I spent several years in the early part of my career in the weeds of heavily optimised DSP code for audio and video algorithms, and you'll hear similar stories from anyone who has served time in edge computing: tales of long nights consuming pizza whilst battling algorithms, compilers and hardware debuggers to get the required performance. And despite all the technological progress that has since been made with new parallel programming languages and smart compilation techniques, when it comes down to it,not much has changed. There remains a perennial demand for maximum performance which can still only be met with hand-engineered, optimised algorithms and low-level libraries and kernels.

Without such performance libraries, the more recently popularised term "accelerated computing"falls short of its promise of accelerating the computing tasks to the full potential of the underlying hardware. In other words, without investing in the software, customers never really unlock the potential of the hardware.  

Is this a new problem? Clearly not! The challenge of getting optimal performance out of any system has always been hard and remains so. Solving the problem requires a broad range of expertise:

  1. A sound understanding of the algorithms and the algorithmic implementation choices (often problems are multi-dimensional and there are many decompositions to choose from).
  2. Adeep knowledge of the hardware micro-architecture and options that are available to exploit the above-mentioned architectural choices for any given algorithm.
  3. A knowledge of the flexibility and capabilities of the coding language and the associated compiler "smarts".
  4. Application of the above knowledge over a sustained period to develop good coverage for the many possibilities that users may need.

It is a rare programmer who can bring all these ingredients together,and then sprinkle on the bit of the magic dust that only the most experienced coding-pixies have acquired, to achieve very good or even optimal performance from the hardware.

So, to make the promise of accelerated computing achievable for everyone, Imagination has applied its expertise to the problem. Our engineers are, after all, best placed to be creating optimised libraries for our own hardware.

What is Imagination's solution?

The flexible micro-architecture of Imagination GPUs offers a great many opportunities for the smart mapping and parallelisation of workloads to maximise the utilisation of the compute engines, and memory hierarchy bandwidths (internal and external to the GPU).

Getting this right for optimal performance has entailed a cross-functional team of experts at Imagination addressing all the above points as well as applying our deep understanding of dynamic elements at play in the runtime system, such as the runtime OpenCL compiler and dynamic scheduling of the hardware. The learnings from this activity have then fed back into improving hardware and compiler design, and the virtuous cycle will continue through our compute and AI roadmap.

At the time of launch, Imagination's OpenCL compute libraries are seeing results that often achieve 3x to 4x performance uplift versus what customers have been telling us they get with open-source solutions.

In the newly launched Imagination DXS GPU, these libraries couple with compute-focused hardware improvements,such as an extra SPU (Scalable Processing Unit) and an additional FP16 pipeline, to generate a 10x or more performance uplift for many compute workloads over what was possible with our previous generation of automotive GPU.

What else?

Our goal with these foundational libraries has been to focus on what is core to us as an IP company: getting the best out of our silicon and enabling users to do the same.   But beyond this, what are we doing?

There are two key areas we are now focused on:

  1. Providing reference compute and AI toolkits that allow customers to leverage the compute libraries in real-world use cases.
  2. Building out our ecosystem of domain-expert partners who can help our customers accelerate their go-to-market goals with solutions and services.

Our recent announcements with MulticoreWare and PerfXLab are just two examples of where innovative partners are showcasing AI solutions built on top our compute libraries and leveraging our reference toolkits.

"PerfXLab Technologies developsheterogeneous computing software stacksandinfrastructuresolutions for companies looking to accelerate AI. We are using Imagination's computesoftware solutions to run various AI applications, including ourLLM inference engine,PerfXLM, on Imagination GPUs and have so far achieved performance gains of up to 100% compared to the CPU, with minimal time spent porting."

Zhang Xianyi, Chief Executive Office at PerfXLab

How do I get my hands on Imagination's Compute Software?

Imagination's compute software solutions empower all stakeholders in the developer journey with a fit-for-purpose "functional to performant to optimal" workflow. They launched alongside the Imagination DXS GPUand are available to customers of this and future GPU IP from Imagination. If you're interested in finding out more about Imagination's roadmap for AI at the edge, contact the Imagination team.