What’s Really Holding Back AI System Scaling?

Breakthrough in data movement enables 100x node density and scale, surpassing present-day systems for next-generation AI with high-density switching.

Here at Baya Systems, we’re in the business of data movement. Compute is great, and some truly remarkable work is being done by NVIDIA, Google, Amazon and others in the world on this front, but what we really love digging into the communication side of things. What’s great about AI systems, at least from our perspective, is that they have an almost insatiable appetite for scaling with almost unreasonable demands for data movement KPIs. Sure, your latest generation GPU is multiple times faster than your previous generation GPU, but how many of those can you stitch together to behave like one giant GPU for the latest AI model that needs hundreds of billions of parameters? Baya Systems can improve AI system scalability by solving data movement problems.

NVIDIA’s impressive run
Let’s first take a moment to really appreciate what NVIDIA has achieved. They saw the importance of data-parallel compute using GPUs, invested in CUDA, and survived long enough as a company for AI’s big moment to finally arrive and truly pay off on their decades of patience. They’re still operating to their long-term vision, looking at system level performance and innovating across the software/hardware stack to deliver system level performance gains that are just incredible. NVIDIA Chief Scientist Bill Dally gave a great talk at Hot Chips in 2023 [1] that broke down the 1000x improvement in 10 years that they’d achieved so far in single GPU performance, drawing from a range of diverse factors like numerical representation (FP32 to Int8), specialized matrix instructions (DP4, HMMA, IMMA), and process scaling (28nm to 5nm). NVIDIA’s AI scaling success so far has relied heavily on a willingness to tackle AI system bottlenecks wherever they may appear. NVLink was first developed by NVIDIA to tackle GPU scaling challenges that PCIe was just not specialized enough and just not performant enough to tackle. As the demand for GPU scaling grew, NVSwitch was introduced to build a data movement fabric that could provide higher bandwidth and lower latency for inter-GPU traffic. Switch fabrics are nothing new really, and networking companies have been building switches and connecting them in various topologies for decades now. What’s interesting about NVSwitches, and about GPU-scaling for AI systems in general, is that flat low latency data movement is a requirement for all-to-all communication patterns, something that traditional networking systems do not have as a strong requirement.
A bit of networking history
Let’s dig into this concept of flat low latency and see how that has played out in the history of networking systems. For regular internet traffic, Cisco and other network providers have been building crossbar-based switches that provide great all-to-all connectivity with non-blocking architecture. An important characteristic of non-blocking architecture is that you can have each port on a switch talk to one counterpart port (without oversubscription), and you’ll be able to get 100% bandwidth for all ports with zero latency growth regardless of traffic load. Now, if you want to stitch together more ports than a single switch supports, networking theory provides all sorts of options like leaf and spine, fat tree, clos, etc., that can preserve the superior performance guarantees of non-blocking architecture. There are a couple of catches here though. One is that these networks require long packet paths, and thus higher minimum latency. The other is that they’re just expensive, and you need a lot of switches to support the rich connectivity required to hierarchically construct non-blocking switching topologies. What datacenters wind up doing in practice is relying on the probabilistic nature of internet traffic to build multi-switch topologies that don’t provide full guarantees for all-to-all traffic but provide great KPIs for bandwidth and latency for most network usage scenarios. This has worked out for them, and what’s interesting is that multi-switch (or multi-ASIC) network topology has a lot of parallels to the sort of mesh fabric what we at Baya Systems build on-die within an SoC package. The primary traffic generators on SoCs are cores, and most core traffic is probabilistic, driven by the combination of what types of applications are being run and how they happened to miss multiple levels of caching. So, network traffic and SoC traffic both have probabilistic traffic, and so the appropriate network solutions for both multi-switch and single-SoC problems tend to be cost-focused architectures that provide enough network capacity to hit KPIs with high probability, but without providing the expensive, hard guarantees that a fully non-blocking topology of crossbar-based switches could provide.
AI systems are just more demanding than traditional networking
So, this history is all great, and it’s only natural to think that we’d want to scale AI systems similarly. The problem is that AI systems behave a lot differently to typical networking workloads. AI systems are more like HPC systems, in that they have much more restricted workloads and thus, much more predictable (and demanding) traffic patterns. It turns out that the traffic patterns for GPU-based AI systems are not at all like traditional non-AI SoCs, in that they’re not going to get a ton of local hits in cache, and they are going to generate enormous amounts of all-to-all traffic. Well, crossbars support full all-to-all traffic patterns with performance guarantees, so you may just want to build a giant crossbar and be done with it. And as it turns out, that’s what NVIDIA has done with NVSwitches. These switches have low latency that does not grow even with 100% bandwidth load, because there’s a crossbar implementation under the hood. That only works up to a point, because crossbars are notoriously hard to scale. And if there’s one thing AI systems want, it’s scale. ISPs solved their scaling problems by simply using multi-tiered network topologies, but that doesn’t work so well for AI systems.  For many GPUs to look like a single giant GPU, each GPU needs to be able to access other GPUs’ memory with the same (or similar) low, deterministic latency as accessing its own memory. NVIDIA has done what it can under these circumstances, by parallelizing NVSwitches in a single layer of switching shared concurrently among multiple GPUs.  A single GPU can use multiple parallel NVSwitches to communicate with other GPUs, by making use of the multiple NVLink ports it has at its disposal. NVIDIA has also embedded AllReduce-enhancing compute within NVSwitches to increase the effectiveness of each byte transferred across this high-cost, low-latency single-layer topology. But AI models are greedy. The scaling is just never enough, and if the fundamental crossbar scaling limitations were somehow lifted, we suspect AI systems would immediately take advantage of that to scale. Now, InfiniBand is useful to parallelize nodes and scale out AI training, but as the OpenAI paper on AI scaling [2] and others have demonstrated, scaling up within a node is more impactful than scaling out across nodes for improving AI model training.
The emergence of Ultra Accelerator Link, Ultra Ethernet and an open ecosystem

Even as NVIDIA has been happily gobbling up the lion’s share of the AI market so far, other companies, big and small, are jumping into the fray. An open ecosystem helps all these contenders, and the recent emergence of UALink as an NVLink competitor and Ultra Ethernet as an InfiniBand competitor is a critical milestone in developing that ecosystem. With this technology, big companies and startups alike can throw their hats into the ring and may the best solution win. We at Baya Systems are excited to partner with these contenders to bring their product dreams to realization.  We can drastically reduce the time and cost to develop new AI switching products via our AI-optimized fabric, codenamed Chiffon-AI, that’s correct-by-construction and physical-design friendly.  Our method of defining and constructing hardware via software is built with the overarching goal of empowering product architects from ideation and design exploration phases, through development and validation, all the way to tape out.

Now, if only there was a way to build a UALink switch, or NVLink switch for that matter, with better underlying technology than crossbars. Wouldn’t that unlock AI node scaling beyond what has been possible so far, and add another boost to continue scaling AI systems and AI models? We believe we’ve made just such a breakthrough at Baya Systems with our NeuraScale product. 

References

[1] Bill Dally NVIDIA. (2023, August 29). HC2023-K2: Hardware for Deep Learning [Video]. YouTube.
https://www.youtube.com/watch?v=rsxCZAE8QNA

[2] Kaplan et al OpenAI. (2020). Scaling Laws for Neural Language Models.
https://arxiv.org/abs/2001.08361

Scroll to Top