How to optimize Raspberry Pi code using its GPU | Pete Warden

Appreciation

Importance

Date Added

2.23.26

TLDR

To get data from DRAM onto the Pi’s QPUs, using the TMU (Texture and Memory Units) for data transfer is faster than using the VPM (Vertex Pipeline Memory). Various notes about VPM usage for GEMM (matrix multiply).

2 Cents

Tags

The TMU is shared hardware unit within each QPU slice that provides an asynchronous, FIFO-based path for QPUs to read data from main memory.
The alternative for memory access is the VPM which is a 4KB globally-shared buffer that uses DMA.
- While VPM DMA can achieve higher raw throughput for large contiguous transfers (up to ~690 MB/s read, ~1120 MB/s write) GitHub, it requires mutex locks and kills parallelism because all QPUs stall while one is doing a DMA transfer.