Back to bookshelf

The wonderful world of the VideoCore IV (Raspberry Pi GPU) | CS 240LX 

Appreciation
8
Importance
8
Date Added
2.9.26
TLDR
This lab introduces the VideoCore IV (Raspberry Pi GPU) through “hello-world”, 2D indexing, parallel addition, and Mandelbrot examples.
2 Cents
Great introduction (goldmine of info) but difficult for me to read because of lack of experience. Hopefully my notes below help others; they basically contain (1) answers to all the questions I had, (2) summary notes, and (3) fun facts :)
Tags

#Introduction and overview

The GPU consists of 16 QPUs (quadcore processing units). The QPUs are technically 4-way SIMD processors (hence quad), but the actual execution is effectively 16-way virtual SIMD by use of multiplexing, and each of the QPU registers are 16-wide vector registers. Thus, we have 16 QPUs, each of which issue 16-wide vector instructions, providing up to 256 parallel execution lanes with SIMD (single instruction multiple data) within QPUs by virtue of the vector instructions, and SPMD (single program multiple data) across QPUs via the QPU scheduler.

  • The GPU has 12 QPUs (basic processing unit) and each QPU is 4-way SIMD, meaning “its ALU” 4 lanes, so you get 4 independent operations per cycle

    • A lane handles one 32-bit operation: 2 operands in, 1 result out)
    • There are 12, NOT 16 (+ other useful info in this forum post )
  • To be more precise regarding “its ALU”: each QPU has two 4-lane ALUs (add + mul) that fire simultaneously. So technically in a single cycle, both can fire simultaneously, giving up to 8 operations per cycle.

  • Through multiplexing (running the same instruction 4 times over 4 cycles on different data), the QPU appears 16-wide to the programmer.

    • (each instruction on a single QPU is applied to 16 32-bit values (registers, immediates), so each register-address too contains 16 32-bit values)
  • 4 QPUs are grouped into a slice, sharing an instruction cache and special function hardware. The GPU has 3 slices = 16 QPUs.

  • Peak theoretical throughput: 12 QPUs × 8 flops/cycle × 250 MHz = 24 GFLOPS.


The GPU has two memory buffers: VPM (Vertex Pipeline Memory) and TMU (Texture Memory Unit). Since we are focusing on general purpose shaders, we will only use VPM (although there are existing UNIX implementations that are slightly faster using TMU. It would be great if someone could accelerate one of the VPM implementations using the TMU). If you're familiar with CUDA, the VPM is somewhat analogous to shared memory - multiple QPUs share one VPM, which can hold ~4KB of data.

  • The GPU has two memory buffers for moving data between DRAM and QPU registers: VPM and TMU.
  • The VPM is a 4KB block of SRAM shared across all 12 QPUs. It's laid out as a 2D array: 16 columns × 64 rows × 4 bytes = 4096 bytes.
  • Data flows: QPU registers ↔ VPM ↔ DRAM. These are two independent paths that can happen in parallel: VPM ↔ registers is a direct QPU access, while VPM ↔ DRAM is handled by a DMA engine in the background.
  • VPM exists because DRAM latency is unpredictable and the QPU pipeline is too simple to handle stalls. VPM provides a small, predictable-latency staging area.

#VC4 structure and registers

  • Each QPU has two register files (A and B), each with 32 registers (ra0-ra31, rb0-rb31). Each register is 16 × 32-bit wide (one value per SIMD lane). (~4KB of private storage per QPU).

    • Only one read per register file per cycle: You can't read from two registers in the same file in the same instruction (e.g. ra1 and ra2 together is illegal).
  • Each QPU also has 6 accumulators (r0-r5), also 16-wide. These are special fast registers that sit even closer to the ALUs.

    • ⭐ Importantly, if you write to a register file (ra5) in one instruction, you can't read it back in the next instruction because of pipeline delay. Accumulators have no such delay, so you can write then read immediately.

      • i.e., accumulators for intermediate values in tight loops, and register files for longer-term storage.

      • Note that r4 is read-only (used by TMU and SFU for returning results), so in practice you have 5 usable accumulators.

      • Pipeline delay: if instr 2 reads a register that instr 1 wrote, the Read happens before the Write and you get the OLD VALUE! Accumulators bypass this via a forwarding path from Exec directly back to Read.

        Cycle:        1     2     3     4     5
        Instr 1:    [Read][Exec][...][Write]
        Instr 2:          [Read][Exec][...][Write]
                            ↑              
                    Write hasn't happened yet!
        

        The QPU pipeline is constructed such that register files have an entire pipeline cycle to perform a read operation. As the QPU pipeline length from register file read to write-back is greater than four cycles, one cannot write data to a physical QPU register file in one instruction and then read that same data for use in the next instruction (no forwarding paths are provided). QPU code is expected to make heavy use of the six accumulator registers, which do not suffer the restriction that data written by an instruction cannot be read in the next instruction.

  • VPM: a 4KB SRAM block (16 columns × 64 rows × 4 bytes) shared across all 12 QPUs. Effectively a staging buffer between registers and DRAM, with two independent paths: QPU ↔ VPM (direct access) and VPM ↔ DRAM (DMA, runs in background). Both can operate simultaneously.

  • Special registers are accessed like normal registers but are actually hardware interfaces:

    • unif is a FIFO queue of parameters passed from the CPU side. Each read pops the next value and broadcasts it across all 16 SIMD lanes (hence "uniform").
    • elem_num holds (0, 1, 2, ..., 15), giving each SIMD lane its own index; a “useful constant.”
    • vr_setup/vw_setup, vr_addr/vw_addr, vr_wait/vw_wait, vpm control VPM and DMA operations (configure, set address, wait for completion, read/write data). Like the memory-mapped control registers we used for the GPIO interfaces.
  • So the memory hierarchy (fastest/smallest → slowest/biggest) is: QPU accumulators, QPU register files, VPM, TMU caches (L1 per slice, 128KB shared L2; managed automatically and I’m not sure we have control over it), main DRAM.

#vc4asm Assembler for the QPU 

  • vc4asm is an assembler: it takes human-readable QPU assembly (.qasm files) and converts them into the 64-bit binary instructions the QPU hardware executes.
    • While the instructions are 64-bit, note that each gets split into two 32-bit words in our instructions array.
  • It also ships with vc4.qinc, a file of standard macros (i.e., reusable templates like vpm_setup(4, 1, h32(0))) for common hardware configuration. Every .qasm file in this lab includes it at the top with .include "../share/vc4inc/vc4.qinc".
  • Compilation flow: the QPU can only execute instructions from GPU-visible shared memory (allocated via the mailbox), not from normal CPU memory where your compiled C code lives. So you memcpy the assembled binary instructions into that shared region before telling the QPU to start executing. Specifically:
    1. Write kernel.qasm
    2. Run vc4asm -c kernel.c -h kernel.h kernel.qasm
    3. This produces a C array of the binary instructions (kernel.c) and a header declaring its existence (kernel.h)
    4. Your CPU-side .c file #includes that header, then memcpys the array into GPU-visible shared memory so the QPU can execute it. (??? TO BE UPDATED: isn’t the DRAM fully accessible to both CPU and GPU in pi 0?)
  • Assembler directives  (.include, .set, .macro, etc.) are instructions to the assembler, not to the QPU. Controls how the assembler processes your file (e.g. pasting in includes, defining constants, creating macros).
  • Assembler instructions  are pretty straightforward, but some items worth noting:
    • Dual-issue packing means multiple logical instructions can be placed in one line as long as their combination still fits into one opcode. Delimit by ;. e.g., mov ra11, rb11; mov rb11, ra11; ldtmu0 = three things packed into one instruction.
    • Condition codes let you execute an instruction only in the SIMD lanes where a condition is true. First set flags with .setf, then use .ifz/.ifn/etc. on later instructions. This is how you do "if" logic within the 16-wide SIMD as all lanes are executing same instruction.
    • Branches (brr, bra) jump to a label. Errata: the 3 instructions after a branch still execute before the branch takes effect. Use nops unless you can do something useful there.
    • Fun fact: mov is not a real QPU instruction . Writing mov ra5, r1 actually emits something like or ra5, r1, r1 (a value OR'd with itself is just itself).

#Part 0A: deadbeef

The “hello world” program here is to have the QPU write 0xDEADBEEF into the output buffer, DMA it out to DRAM, and then have the CPU check if 0xDEADBEEF is sitting in the output buffer. Deadbeef just refers to a recognizable debug value .

The struct GPU is a programmer-defined layout for a single shared memory block that keeps everything the GPU program needs (buffers, code, uniforms, and launch metadata) organized in one contiguous allocation so both the CPU and GPU can access it:

struct GPU
{
  uint32_t input[SOME_SIZE];
  uint32_t output[SOME_SIZE];
  ///Other program-buffers as needed
  uint32_t code[CODE_SIZE];
  uint32_t unif[NUM_QPUS][NUM_UNIFS];
  uint32_t unif_ptr[NUM_QPUS];
  uint32_t mail[2];
  uint32_t handle;
};

TLDR of deadbeef part:

  1. mem_alloc block for that CPU and GPU to access
  2. Fill output[] with 0xFFFFFFFF so we can confirm it changes
  3. memcpy the assembled QPU binary into code[]
  4. Set uniforms: for deadbeef, just the GPU address of output[] so the QPU knows where to write the output to
  5. Set unif_ptr to tell the scheduler where each QPU's uniforms start
  6. Set mail to the launch command: give (1) GPU address of code and (2) GPU address of unif_ptr
  7. Send mail through the mailbox to kick off execution
  8. Check output[] and if it now contains 0xDEADBEEF instead of 0xFFFFFFFF, the whole pipeline worked

Details:

  1. mem_alloc and mem_lock are mailbox utility functions, as defined in mailbox.c.

    • handle = mem_alloc(sizeof(struct GPU), 4096, GPU_MEM_FLG); asks the GPU firmware to reserve a chunk of memory and gives you back a handle (like a ticket number). We stash it away for unlocking with ptr->handle = handle;.
    • vc = mem_lock(handle); takes that handle, pins the memory in place, and gives you the actual GPU bus address where it lives (because in theory the GPU mem manager could relocate unlocked allocations).
  2. On the line *gpu = ptr;: after gpu_prepare returns, gpu in notmain points to the CPU-addressable struct GPU, and everything is initialized and ready to go.

On mail[1], unif_ptr, unif_ptr[0], and unif_ptr[0][0] as this was quite confusing:

  • gpu_fft_base_exec_direct (which bypasses the mailbox, see quote below) needs unif, an array of per-QPU pointers, where each element is the GPU address of that QPU's uniforms.

    There's technically a mailbox call to execute code on the GPU, but we couldn't figure out how to get that to work (please let us know if you do!). Instead, to get the code to run we had to directly write to the GPU control registers as you would any of the other hardware peripherals we've worked on.

  • Hence: unif_ptr[0] is the GPU address of unif[0] (points to the uniform array for QPU 0)

  • unif[0][0] is the actual uniform value (in this case, GPU address of output array). Why? Remember the uniform is special and treated as a FIFO: each time the kernel does mov rX, unif, it pops the next 32-bit value off the queue.

  • mail[1] is the GPU address of unif (the base of the whole 2D unif table)

And that's it! A complete hello-world program on the GPU, and we're only ... about 2,700 words into the README.