The wonderful world of the VideoCore IV (Raspberry Pi GPU) | CS 240LX
#Introduction and overview
The GPU consists of 16 QPUs (quadcore processing units). The QPUs are technically 4-way SIMD processors (hence quad), but the actual execution is effectively 16-way virtual SIMD by use of multiplexing, and each of the QPU registers are 16-wide vector registers. Thus, we have 16 QPUs, each of which issue 16-wide vector instructions, providing up to 256 parallel execution lanes with SIMD (single instruction multiple data) within QPUs by virtue of the vector instructions, and SPMD (single program multiple data) across QPUs via the QPU scheduler.
-
The GPU has 12 QPUs (basic processing unit) and each QPU is 4-way SIMD, meaning “its ALU” 4 lanes, so you get 4 independent operations per cycle
- A lane handles one 32-bit operation: 2 operands in, 1 result out)
- There are 12, NOT 16 (+ other useful info in this forum post )
-
To be more precise regarding “its ALU”: each QPU has two 4-lane ALUs (add + mul) that fire simultaneously. So technically in a single cycle, both can fire simultaneously, giving up to 8 operations per cycle.
-
Through multiplexing (running the same instruction 4 times over 4 cycles on different data), the QPU appears 16-wide to the programmer.
- (each instruction on a single QPU is applied to 16 32-bit values (registers, immediates), so each register-address too contains 16 32-bit values)
-
4 QPUs are grouped into a slice, sharing an instruction cache and special function hardware. The GPU has 3 slices = 16 QPUs.
-
Peak theoretical throughput: 12 QPUs × 8 flops/cycle × 250 MHz = 24 GFLOPS.
The GPU has two memory buffers: VPM (Vertex Pipeline Memory) and TMU (Texture Memory Unit). Since we are focusing on general purpose shaders, we will only use VPM (although there are existing UNIX implementations that are slightly faster using TMU. It would be great if someone could accelerate one of the VPM implementations using the TMU). If you're familiar with CUDA, the VPM is somewhat analogous to shared memory - multiple QPUs share one VPM, which can hold ~4KB of data.
- The GPU has two memory buffers for moving data between DRAM and QPU registers: VPM and TMU.
- The VPM is a 4KB block of SRAM shared across all 12 QPUs. It's laid out as a 2D array: 16 columns × 64 rows × 4 bytes = 4096 bytes.
- Data flows: QPU registers ↔ VPM ↔ DRAM. These are two independent paths that can happen in parallel: VPM ↔ registers is a direct QPU access, while VPM ↔ DRAM is handled by a DMA engine in the background.
- VPM exists because DRAM latency is unpredictable and the QPU pipeline is too simple to handle stalls. VPM provides a small, predictable-latency staging area.
#VC4 structure and registers
-
Each QPU has two register files (A and B), each with 32 registers (
ra0-ra31,rb0-rb31). Each register is 16 × 32-bit wide (one value per SIMD lane). (~4KB of private storage per QPU).- Only one read per register file per cycle: You can't read from two registers in the same file in the same instruction (e.g.
ra1andra2together is illegal).
- Only one read per register file per cycle: You can't read from two registers in the same file in the same instruction (e.g.
-
Each QPU also has 6 accumulators (
r0-r5), also 16-wide. These are special fast registers that sit even closer to the ALUs.-
⭐ Importantly, if you write to a register file (
ra5) in one instruction, you can't read it back in the next instruction because of pipeline delay. Accumulators have no such delay, so you can write then read immediately.-
i.e., accumulators for intermediate values in tight loops, and register files for longer-term storage.
-
Note that
r4is read-only (used by TMU and SFU for returning results), so in practice you have 5 usable accumulators. -
Pipeline delay: if
instr 2reads a register thatinstr 1wrote, the Read happens before the Write and you get the OLD VALUE! Accumulators bypass this via a forwarding path from Exec directly back to Read.Cycle: 1 2 3 4 5 Instr 1: [Read][Exec][...][Write] Instr 2: [Read][Exec][...][Write] ↑ Write hasn't happened yet!The QPU pipeline is constructed such that register files have an entire pipeline cycle to perform a read operation. As the QPU pipeline length from register file read to write-back is greater than four cycles, one cannot write data to a physical QPU register file in one instruction and then read that same data for use in the next instruction (no forwarding paths are provided). QPU code is expected to make heavy use of the six accumulator registers, which do not suffer the restriction that data written by an instruction cannot be read in the next instruction.
-
-
-
VPM: a 4KB SRAM block (16 columns × 64 rows × 4 bytes) shared across all 12 QPUs. Effectively a staging buffer between registers and DRAM, with two independent paths: QPU ↔ VPM (direct access) and VPM ↔ DRAM (DMA, runs in background). Both can operate simultaneously.
-
Special registers are accessed like normal registers but are actually hardware interfaces:
unifis a FIFO queue of parameters passed from the CPU side. Each read pops the next value and broadcasts it across all 16 SIMD lanes (hence "uniform").elem_numholds(0, 1, 2, ..., 15), giving each SIMD lane its own index; a “useful constant.”vr_setup/vw_setup,vr_addr/vw_addr,vr_wait/vw_wait,vpmcontrol VPM and DMA operations (configure, set address, wait for completion, read/write data). Like the memory-mapped control registers we used for the GPIO interfaces.
-
So the memory hierarchy (fastest/smallest → slowest/biggest) is: QPU accumulators, QPU register files, VPM, TMU caches (L1 per slice, 128KB shared L2; managed automatically and I’m not sure we have control over it), main DRAM.
#vc4asm Assembler for the QPU
- vc4asm is an assembler: it takes human-readable QPU assembly (
.qasmfiles) and converts them into the 64-bit binary instructions the QPU hardware executes.- While the instructions are 64-bit, note that each gets split into two 32-bit words in our instructions array.
- It also ships with
vc4.qinc, a file of standard macros (i.e., reusable templates likevpm_setup(4, 1, h32(0))) for common hardware configuration. Every.qasmfile in this lab includes it at the top with.include "../share/vc4inc/vc4.qinc". - Compilation flow: the QPU can only execute instructions from GPU-visible shared memory (allocated via the mailbox), not from normal CPU memory where your compiled C code lives. So you
memcpythe assembled binary instructions into that shared region before telling the QPU to start executing. Specifically:- Write
kernel.qasm - Run
vc4asm -c kernel.c -h kernel.h kernel.qasm - This produces a C array of the binary instructions (
kernel.c) and a header declaring its existence (kernel.h) - Your CPU-side
.cfile#includes that header, thenmemcpys the array into GPU-visible shared memory so the QPU can execute it. (??? TO BE UPDATED: isn’t the DRAM fully accessible to both CPU and GPU in pi 0?)
- Write
- Assembler directives (
.include,.set,.macro, etc.) are instructions to the assembler, not to the QPU. Controls how the assembler processes your file (e.g. pasting in includes, defining constants, creating macros). - Assembler instructions are pretty straightforward, but some items worth noting:
- Dual-issue packing means multiple logical instructions can be placed in one line as long as their combination still fits into one opcode. Delimit by
;. e.g.,mov ra11, rb11; mov rb11, ra11; ldtmu0= three things packed into one instruction. - Condition codes let you execute an instruction only in the SIMD lanes where a condition is true. First set flags with
.setf, then use.ifz/.ifn/etc. on later instructions. This is how you do "if" logic within the 16-wide SIMD as all lanes are executing same instruction. - Branches (
brr,bra) jump to a label. Errata: the 3 instructions after a branch still execute before the branch takes effect. Usenops unless you can do something useful there. - Fun fact:
movis not a real QPU instruction . Writingmov ra5, r1actually emits something likeor ra5, r1, r1(a value OR'd with itself is just itself).
- Dual-issue packing means multiple logical instructions can be placed in one line as long as their combination still fits into one opcode. Delimit by
#Part 0A: deadbeef
The “hello world” program here is to have the QPU write 0xDEADBEEF into the output buffer, DMA it out to DRAM, and then have the CPU check if 0xDEADBEEF is sitting in the output buffer. Deadbeef just refers to a recognizable debug value .
The struct GPU is a programmer-defined layout for a single shared memory block that keeps everything the GPU program needs (buffers, code, uniforms, and launch metadata) organized in one contiguous allocation so both the CPU and GPU can access it:
struct GPU
{
uint32_t input[SOME_SIZE];
uint32_t output[SOME_SIZE];
///Other program-buffers as needed
uint32_t code[CODE_SIZE];
uint32_t unif[NUM_QPUS][NUM_UNIFS];
uint32_t unif_ptr[NUM_QPUS];
uint32_t mail[2];
uint32_t handle;
};
⭐ TLDR of deadbeef part:
mem_allocblock for that CPU and GPU to access- Fill
output[]with0xFFFFFFFFso we can confirm it changes memcpythe assembled QPU binary intocode[]- Set uniforms: for deadbeef, just the GPU address of
output[]so the QPU knows where to write the output to - Set
unif_ptrto tell the scheduler where each QPU's uniforms start - Set
mailto the launch command: give (1) GPU address of code and (2) GPU address of unif_ptr- As covered in our mailbox lab
- Send
mailthrough the mailbox to kick off execution - Check
output[]and if it now contains0xDEADBEEFinstead of0xFFFFFFFF, the whole pipeline worked
Details:
-
mem_allocandmem_lockare mailbox utility functions, as defined inmailbox.c.handle = mem_alloc(sizeof(struct GPU), 4096, GPU_MEM_FLG);asks the GPU firmware to reserve a chunk of memory and gives you back a handle (like a ticket number). We stash it away for unlocking withptr->handle = handle;.vc = mem_lock(handle);takes that handle, pins the memory in place, and gives you the actual GPU bus address where it lives (because in theory the GPU mem manager could relocate unlocked allocations).
-
On the line
*gpu = ptr;: aftergpu_preparereturns,gpuinnotmainpoints to the CPU-addressablestruct GPU, and everything is initialized and ready to go.
On mail[1], unif_ptr, unif_ptr[0], and unif_ptr[0][0] as this was quite confusing:
-
gpu_fft_base_exec_direct(which bypasses the mailbox, see quote below) needsunif, an array of per-QPU pointers, where each element is the GPU address of that QPU's uniforms.There's technically a mailbox call to execute code on the GPU, but we couldn't figure out how to get that to work (please let us know if you do!). Instead, to get the code to run we had to directly write to the GPU control registers as you would any of the other hardware peripherals we've worked on.
-
Hence:
unif_ptr[0]is the GPU address ofunif[0](points to the uniform array for QPU 0) -
unif[0][0]is the actual uniform value (in this case, GPU address of output array). Why? Remember the uniform is special and treated as a FIFO: each time the kernel doesmov rX, unif, it pops the next 32-bit value off the queue. -
mail[1]is the GPU address ofunif(the base of the whole 2Duniftable)
And that's it! A complete hello-world program on the GPU, and we're only ... about 2,700 words into the README.