Unleashing Heterogeneous Compute: Lessons from Real-World System Design

by Brian Carlson

Introduction

At the Andes RISC-V CON Silicon Valley held in San Jose, California, in April 2025, Imagination Technologies and Baya Systems delivered a compelling presentation titled “Unleashing Heterogeneous Compute: Lessons from Real-World System Design.” This session, part of the developer track, showcased a joint demonstration leveraging Baya Systems’ WeaverPro™ CacheStudio software to model and analyze system behavior across CPU, GPU, and hybrid configurations. Presented by Pallavi Sharma, director of product management from Imagination Technologies, and Dr. Eric Norige, chief software architect from Baya Systems, the session emphasized practical insights into optimizing heterogeneous compute architectures, focusing on data movement, synchronization, and memory reuse.

This blog summarizes the key points from their presentation. 

We invite you to download the presentation here.

The Challenge of Heterogeneous Compute

The presentation opened with a fundamental principle: “If you can’t feed it, you can’t use it.” In modern System-on-Chip (SoC) designs, the abundance of compute resources—more cores, GPUs, and acceleration blocks—does not automatically translate to better performance. The bottleneck lies in efficiently moving data, synchronizing execution, and managing memory. The speakers highlighted that true heterogeneity is not just about integrating multiple compute engines but ensuring they cooperate architecturally. Key factors influencing performance include: 

  • Interaction with shared memory 
  • Scheduling decisions affecting latency 
  • Cache coherence policies, fabric layout, and cache sizing 
These elements collectively determine how well a system can handle data-intensive workloads, a critical consideration for heterogeneous platforms.

Cache Analysis as a Diagnostic Tool

The core of the presentation revolved around a series of controlled profiling experiments using CacheStudio, a tool developed by Baya Systems. These experiments analyzed cache behavior across L1, L2, and L3 cache levels in CPU-only, GPU-only, and mixed CPU+GPU workloads. The goal was to uncover behavioral patterns—such as stalls, scalability, and saturation points—rather than merely benchmarking absolute performance. By varying cache sizes (e.g., L1 at 16 KB and 32 KB, L2 at 64 KB, 128 KB, and 256 KB), the team isolated the impact of cache configurations on hit rates and memory load. 

The analysis revealed how different cache sizes and configurations affect system performance, particularly in terms of contention, reuse, and locality mismatch. For instance, the GPU cache analysis on L2 showed hit rates varying significantly with L1 and L2 size combinations, with specific data points indicating a hit rate of 45% at 16 KB L1 and 64 KB L2. These findings underscored the importance of tailoring cache configurations to specific workload demands to optimize throughput and minimize bottlenecks. 

Key Insights and Design Learnings

The presentation emphasized that cache analysis serves as a proxy for understanding broader system-level dynamics. It makes “invisible” issues—such as contention, poor data locality, and interconnect inefficiencies—visible to system architects. The key takeaway was the need for early system-level profiling to shape architecture design, rather than focusing solely on cache sizing or adding more compute engines. The speakers stressed that performance in heterogeneous systems stems from how compute engines interact, not just their raw computational power.  Specific design learnings included: 
  • Optimizing Data Movement: Efficient data transfer between compute engines and memory is critical to prevent bottlenecks. 
  • Synchronization and Scheduling: Proper scheduling reduces latency and ensures smooth coordination between CPUs and GPUs. 
  • Cache Coherence and Fabric Layout: Well-designed coherence policies and interconnect fabrics are essential for maintaining performance in shared memory systems. 
  • Memory Reuse: Intelligent memory management enhances efficiency by reducing redundant data fetches. 
  •  
These insights are broadly applicable to any data-intensive heterogeneous platform, from edge devices to high-performance computing systems. 

Closing Thoughts

The presentation concluded with a reiteration of the guiding principle: “If you can’t feed it, you can’t use it.” Cache modeling was presented not as the end goal but as a tool to inform architectural decisions. By profiling systems early, designers can address fundamental issues in data flow and resource coordination, leading to more efficient and scalable heterogeneous architectures. The session underscored the importance of a holistic approach to system design, where cooperation between compute engines is prioritized over merely increasing their number. 

Conclusion

The joint presentation by Imagination Technologies and Baya Systems at Andes RISC-V CON 2025 provided valuable insights into the challenges and solutions of heterogeneous compute architectures. By leveraging Baya System’s WeaverPro CacheStudio software for detailed cache analysis, the session highlighted the critical role of data movement, synchronization, and memory management in achieving high performance. The findings serve as a guide for system architects working on next-generation SoCs, emphasizing architectural cooperation over mere integration. This collaboration between Imagination Technologies, Baya Systems, and Andes Technology underscores the importance of industry partnerships in advancing RISC-V-based solutions for diverse applications. 

by Brian Carlson

Scroll to Top