Unleashing Heterogeneous Compute: Lessons from Real-World System Design
by Brian Carlson
Introduction
At the Andes RISC-V CON Silicon Valley held in San Jose, California, in April 2025, Imagination Technologies and Baya Systems delivered a compelling presentation titled “Unleashing Heterogeneous Compute: Lessons from Real-World System Design.” This session, part of the developer track, showcased a joint demonstration leveraging Baya Systems’ WeaverPro™ CacheStudio software to model and analyze system behavior across CPU, GPU, and hybrid configurations. Presented by Pallavi Sharma, director of product management from Imagination Technologies, and Dr. Eric Norige, chief software architect from Baya Systems, the session emphasized practical insights into optimizing heterogeneous compute architectures, focusing on data movement, synchronization, and memory reuse.
This blog summarizes the key points from their presentation.
We invite you to download the presentation here.
The Challenge of Heterogeneous Compute
The presentation opened with a fundamental principle: “If you can’t feed it, you can’t use it.” In modern System-on-Chip (SoC) designs, the abundance of compute resources—more cores, GPUs, and acceleration blocks—does not automatically translate to better performance. The bottleneck lies in efficiently moving data, synchronizing execution, and managing memory. The speakers highlighted that true heterogeneity is not just about integrating multiple compute engines but ensuring they cooperate architecturally. Key factors influencing performance include:
- Interaction with shared memory
- Scheduling decisions affecting latency
- Cache coherence policies, fabric layout, and cache sizing
Cache Analysis as a Diagnostic Tool
The analysis revealed how different cache sizes and configurations affect system performance, particularly in terms of contention, reuse, and locality mismatch. For instance, the GPU cache analysis on L2 showed hit rates varying significantly with L1 and L2 size combinations, with specific data points indicating a hit rate of 45% at 16 KB L1 and 64 KB L2. These findings underscored the importance of tailoring cache configurations to specific workload demands to optimize throughput and minimize bottlenecks.
Key Insights and Design Learnings
- Optimizing Data Movement: Efficient data transfer between compute engines and memory is critical to prevent bottlenecks.
- Synchronization and Scheduling: Proper scheduling reduces latency and ensures smooth coordination between CPUs and GPUs.
- Cache Coherence and Fabric Layout: Well-designed coherence policies and interconnect fabrics are essential for maintaining performance in shared memory systems.
- Memory Reuse: Intelligent memory management enhances efficiency by reducing redundant data fetches.
Closing Thoughts
The presentation concluded with a reiteration of the guiding principle: “If you can’t feed it, you can’t use it.” Cache modeling was presented not as the end goal but as a tool to inform architectural decisions. By profiling systems early, designers can address fundamental issues in data flow and resource coordination, leading to more efficient and scalable heterogeneous architectures. The session underscored the importance of a holistic approach to system design, where cooperation between compute engines is prioritized over merely increasing their number.