Arm Unveils Lumex Compute Subsystem For Powerful, Efficient On-Device AI
Arm’s new Lumex compute subsystem brings a big boost to on-device AI, graphics, and general compute tasks to flagship smartphones
As the smartphone market has matured, the workloads that consumers expect from their tiny in-pocket mobile computers has increased drastically. Fortunately, chip designers continue to build faster processors that do perform well with varied workloads, without completely tanking battery life. Tonight, Arm introduced its Lumex Compute Subsystem (CSS) platform, which drives big improvements in not only general CPU workloads, but also on-device artificial intelligence and gaming tasks, too. Arm Lumex CPU Enhancements
The big news is that every part of the Lumex CSS has been purpose-built to make on-device AI better. The Lumex CPU cores implement Scalable Matrix Extension v2 (SME2) instructions that are built for the matrix operations that modern AI models need. While we believe that AI-specific neural coprocessors are going to remain a vital part of any mobile SoC, adding these instructions essentially turns the CPU block into an AI coprocessor on its own. Arm says that these new accelerated instructions will enable Arm licensees to bring AI devices to market faster with performance more akin to their desktop brethren.
Arm says Lumex CSS-based CPU architectures should be available up and down a customer’s product stack from flagships down to low-powered efficient devices. The various designs can power anything from a PC to a wearables with the smallest form factors. To handle all of that, Lumex CSS designs include four different core types.
At the top is C1-Ultra, which is the highest performing core design with a 25% single-thread performance increase from the previous generation Neoverse year-over-year. These are what you might think of as «prime» and «performance» cores with the highest clock rates, the best performance, and highest power usage. These are suitable for large model inference, AI-fueled photography features, and generative AI content.
Below that is C1-Premium, which Arm says packs C1-Ultra performance into a 35% smaller area compared to C1-Ultra. Most likely this will come at the cost of energy efficiency, which will affect clock speeds and therefore somewhat decrease performance. These cores will be the primary CPU cores in sub-flagship mobile devices, as well as multi-tasking cores for things like voice assistants and background tasks on flagships.
The efficiency core design is C1-Pro. These have a 16% increase in sustained performance over previous gen designs, meaning they won’t back down from peak speeds as quickly when boosted. They’re the cores that Arm says device makers will want to offload video playback and streaming inference tasks to. They’ll likely be found in just about any Lumex-powered design.
Lastly, C1-Nano is the most power efficient design. These reduce power consumption by upwards of 26% and use less area than C1-Pro. More often than not, C1-Nano cores will be found in wearables like watches, smart rings, and so on.
KleidiAI Makes AI More Mobile-Friendly
To go along with the new SME2 instructions discussed above, Arm also announced KleidiAI integration for all major AI frameworks. Arm says that apps built on PyTorch’s ExecuTorch, Google’s LiteRT, Alibaba MNN, and Microsoft’s ONNX Runtime will all benefit from the opportunity for increased performance without any code changes.
Lumex brings new portability to cross-platform apps. For example, key Google apps like Gmail, YouTube, and Google Photos are already capable of flexing SME2 performance improvements. Because SME2 will exist across all Arm platforms built with Lumex devices, using any of the frameworks mentioned above will enhance Windows on Arm and other platforms. Alipay has also shown off on-device LLMs running with SME2, as well. Arm says thousands of Android applications with AI won’t even need a code change to use SME2
Overall, Arm says that a cluster of C1-family CPUs increase performance up to 5x in AI tasks. The performance of «efficient AI» experiences running on efficient cores that don’t drain the battery so hard, increases by approximately 3x over the last generation. Both of those figures are built on a pair of C1-Ultra performance cores flanked by six C1-Pro efficiency cores.
Arm says that KleidiAI and SME2 will increase performance on existing AI platforms. Samsung, MediaTek, and Apple are all called out for improving responsiveness and efficiency of on-device AI apps. Automated translations and summaries will apparently benefit from the technology too. Lumex Brings Gaming Performance to Mali GPUs
Arm says that its new Mali G1-Ultra GPU will enable console-class graphics on smartphones. The new Ray Tracing Unit v2 (RTUv2) increases performance in advanced lighting, shadows, and reflections by a factor of two compared to the last generation of Arm GPUs, dubbed Immortalis G925. It doesn’t seem that real-time ray traced graphics have taken over mobile gaming like they have on desktop and console platforms, but it does seem to be a matter of time before mobile gamers will demand it.
Beyond just ray tracing, Arm also says that Mali G1-Ultra will deliver a 20% increase in graphics benchmarks compared to the last generation. The company specifically calls out Arena Breakout, Fortnite, Genshin Impact, and Honkai Starail. Additionally, these titles will see increased performance and power efficiency on sub-flagship and efficiency-focused devices equipped with Mali G1-Premium and G1-Pro GPUs, as well.
Arm says that graphics benchmarks with the new 14-core G1-Ultra GPU will show a 20% performance increase while simultaneously using 9% less energy per frame compared to Immortalis G925. AI inference on the GPU is also increased to the tune of 20%. All of that is in addition to the doubled performance with ray tracing. Arm Lumex Enhances AI Across the Platform
Arm is making big claims with its Lumex announcement. The company says that AI performance will increase by up to 5X across Lumex devices compared with the last generation, including 4.7x lower latency for speech-based workloads and 2.8x faster audio generation. Those are some pretty big numbers.
That remains to be seen, however, as devices leveraging the new designs have yet to be announced. And in the US, most Android devices use chips designed by Qualcomm, who has been going toe-to-toe with Arm in court. As devices start to come to market, you can count on HotHardware to be right there ready and able to report on the latest, so stay tuned.