Cycle Counting, Memory Stalls, Prefetch and Other Pitfalls

When writing a video game emulator, the first and most foundational step is to implement the system’s CPU. At first glance, this seems to primarily involve implementing the instruction set. After all, the instruction set is what actually specifies what the system does at a basic level. However, just as important, is another feature of the CPU, known as cycles.

Cycles and pipelines and stalls, oh my! ⬨

Below the instruction execution is something known as the clock, which drives the timing of the processor. When seeing how many Hz a processor has, this number specifies how many times per second the clock oscillates, and each oscillation is one clock cycle. The clock is extremely important for all aspects of the system, but it’s not always obvious how. On more modern systems, cycles get abstracted away into a general “how fast” a system is, and with some emulators, keeping track of cycles isn’t that important. However, for older game consoles (certainly everything from the very beginning to the fourth generation of consoles and seventh generation handhelds, but my knowledge vaguely ends at that point), cycle counting is a fundamental aspect. These consoles have no operating system to speak of (with the exception of seventh generation handhelds other than the DS, and some assorted outliers), and everything runs directly on bare metal. While operating systems tend to abstract away fundamental parts of the hardware by confining games to user space, which can be preempted by the kernel at any time, the only thing that can preempt bare metal code is an interrupt. As a system gets more complex, factors like the operating system introduce entropy that makes execution become increasingly non-deterministic, so keeping close count of the cycles becomes less important, and enables high-level emulation to optimize routines that aren’t part of the game, but part of the system itself. But on older systems, everything must be emulated in approximately the right time, or else things fall apart. The way this is done is called cycle counting.

On the simplest of systems, each instruction may occur in exactly one cycle. This process involves fetching an instruction to execute from program memory (historically, this memory is read-only, hence the term ROM, and may be mapped separately from writable memory via various memory models, but this is beyond the scope of this article), decoding the instruction in hardware so it knows which part of the processor to use, and finally executing the instruction itself. Each stage of this process contributes to something known as the pipeline, and on a simple system, these stages are often merged in the emulation model. However, as systems get more complex, some instructions may take more than one cycle. For example, loading values from memory may take a while trying to load the physical memory, or jumping to a different region of the program may take time while the pipeline gets flushed and loaded with new instructions. This is known as a stall or bubble in the pipeline. Having a stall in the pipeline means that the whole processor wastes one or more cycles, causing execution to slow down temporarily. Modern processors have accommodations for these stalls by having more complicated pipelines that can minimize the impact of a stall. But classic video game consoles don’t have anything remotely resembling a complex pipeline.

Cycle accuracy ⬨

While keeping track of how many cycles a processor has executed itself may not seem that important, except for perhaps keeping track of how fast a processor is, there are many more things in a game system than just the CPU. Every other component in the system is typically driven by the same clock, meaning that each of the other compontents, like the video processor and the sound processor, have to be properly synchronized with the CPU by counting cycles. Otherwise, the concept of a frame might become decoupled from the CPU, which causes all sorts of issues. Keeping track of cycles is absolutely central to getting an emulator to be timed properly.

Depending on the system and how a game is programmed, there might be a strong necessity to get proper timing between the CPU and other subsystems. The concept of cycle accuracy comes into play, which is a term that gets thrown out a lot in video game emulation. But what does it mean? Well, at the strictest level, it means that everything that happens in the same cycle on hardware is properly counted and timed in the emulator. Doing this is often prohibitively expensive and a hugely unnecessary level of accuracy. But for perfect emulation of some systems, it may be a necessity.

Take an example where each scanline of the game takes 120 cycles. One pixel is drawn each cycle, and when it reaches the end of the scanline, it gathers information on what will be drawn in the next scanline. Now let’s say instructions all take one cycle unless they need to load from memory. If we need to move sprites around on the screen, we need to make sure to do it in those 120 cycles, and the software might be written in such a way that it knows exactly how long this takes. After all, the whole system is timed predictably. If the instructions take too long, sprites won’t get moved for the next scanline. If the instructions take not long enough, sprites will get moved on the wrong scanline. A naïve implementation may assume that memory accesses are free, and then suddenly this game has sprites in the wrong place. For a system like this, cycle accuracy is critically important. However, in practice, timing on systems is not quite so tight, and there are more complexities to the cycle counting for both the CPU and the graphics processing.

Game Boy Advance cycles ⬨

The Game Boy Advance contains an ARM7TDMI, a relatively old CPU from 1998 that runs the antiquated ARMv4T architecture. (ARM decided that having the processor version be of the form “ARMnumber” and the architecture version be “ARMvnumber”, was a good idea, and has confused precisely every single person who has tried to understand it at first.) ARMv4 is a simple architecture compared to the more modern ARMv7A or ARMv8 (AArch64) that runs on most modern devices like phones and tablets. However, they’re historically related. The ARMv4T, in particular, has two execution modes, called ARM mode, which uses fixed-length 32-bit instructions, and Thumb mode, which uses fixed-length 16-bit instructions. This is different from other classic game consoles, which have variable length encodings as most CISC systems have, but ARM is a RISC system, which tends to imply fixed-length instructions. Additionally, unlike most other classic video game systems, the GBA has 32-bit registers, and 16 general-purpose registers. Older consoles tend to have 8-bit or 16-bit registers, which is where the terms 8-bit and 16-bit for these systems derive, and not very many registers at all.

As described in a previous post, the ARM7TDMI has a three stage pipeline. Like the example used in this article, the stages are fetch, decode, execute. First, an instruction is loaded from memory, then it’s decoded, then executed. However, it’s not all done at once. When one instruction is executing, the next instruction is decoding, and the instruction after that is being fetched. The ARM7TDMI also has four different types of cycles, on which the CPU clock may stall for a different amount of time: S cycles, the most common type, refer to sequential memory access. Basically, if you access memory at one location and then the location afterwards, this second access is sequential, so the memory bus can fetch it more quickly. Next are N cycles, which refer to non-sequential memory accesses. N cycles occur when a memory address is fetched that has nothing to do with the previous instruction. Third are I cycles, which are internal cycles. These occur when the CPU does a complicated operation, such as multiplication, that it can’t complete in a single cycle. Finally are C cycles, or coprocessor cycles, which occur when communicating with coprocessors in the system. However, the GBA has no ARM-specified coprocessors, and all instructions that try to interact with coprocessors trigger an error state. Thus, the only important cycles to the GBA are S, N and I.

Per I cycle, the processor will always only take one clock cycle. S and N cycles, however, may stall the CPU pipeline while waiting for memory accesses. If memory accesses are free, each S or N cycle takes only one clock cycle. However, many types of memory access on the GBA are not free, so the CPU will quite often stall.

How long each stall is depends on which region of memory is being accessed. The GBA refers to these stalls as “wait states”. The GBA has nine distinct regions of memory:

The BIOS (region 0), which contains basic interrupt handling code and a handful of utility functions for doing common operations, such as decompressing game data or waiting for the end of a frame. This region has zero wait states, although it can only be read from while executing code from within the region.
The External Working RAM (region 2), which is 256kiB and is used for storing memory that doesn’t need to be handled frequently. There is a 16-bit bus to access this memory, so loading 32-bit values from it requires an additional S cycle above the first N or S cycle to load the upper 16-bits of the value. (E)WRAM has 2 wait states, regardless of if it’s N or S, so the first access is always 3 cycles, as is the second, so loading a 32-bit value will always take 6 cycles. (It is possible to “overclock” the WRAM, but no games do this, and it can crash the system.)
The Internal Working RAM (region 3) is on the same chip as the ARM7TDMI itself, and has no wait states whatsoever. It’s useful for storing 32kiB of active memory, including the execution stack and interrupt handlers.
Memory-mapped I/O registers (region 4) contains 16-bit registers that are used for configuring the system itself. It also has no wait states, but is only useful for configuring the system, not storing anything or executing code.
Palette memory (region 5) contains the color table used for graphics on the screen. There are no wait states here.
Video RAM (region 6) contains all of the graphics that will be drawn to the screen. It’s primarily accessible with 16-bit read/writes and unexpected things happen with 8-bit read/writes. VRAM has no wait states, unless the graphics subsystem is trying to use VRAM, in which case the processor will stall for one cycle. Note that this alone adds a level of complexity making timing not easy to manage. This memory is very fast for executing code if you don’t want to put it in IWRAM. There’s no real reason to do this, but it’s possible to do.
Object Attribute Memory (region 7) contains configuration values for the sprites (referred to as OBJs in the GBA). There are zero wait states here, unless the graphics system is trying to access the OAM at that time, at which point the processor will stall for one cycle.
Game Pak ROM (regions 8–13) contains up to 32MiB of memory that resides on the cartridge. The cartridge bus is also 16-bits, but N cycles and S cycles have different timing on this bus—they are configurable. N cycles can be configured to have either 2, 3, 4, or 8 wait states in these regions. S cycles actually depend on which specific address is used. Region 8/9 has either 1 or 2 wait states, region 10/11 has either 1 or 4, and region 12/13 has either 1 or 8. Regions 10-13 are mostly unused, however, so most of what’s important is region 8/9.
Game Pak RAM (region 14) contains rewritable memory on the cartridge. The bus is only 8 bits wide, and wait states are configurable to be either 2, 3, 4, or 8 cycles. There are no special S cycles on this bus.

The interactions between the S, N and I cycles with ARMv4T instructions is well documented in Martin Korth’s GBATEK document, if you keep in mind that ARM is 32-bit (hence two fetches on the cartridge bus) and Thumb is 16-bit (hence one fetch). However, one additionally configurable setting that affects timing is the extremely poorly documented Game Pak prefetch.

Game Pak prefetch ⬨

Accessing ROM is slow. Accessing ROM can be very slow. To help alleviate this, Nintendo added a feature that will, when the cartridge bus is idle, pull the next instruction from the cartridge automatically, reducing the number of wait states that the processor needs to stall to load the next instruction. This prefetch unit is situated between the memory bus and the cartridge bus, allowing it to operate independently of the memory bus. As such, it can load from the cartridge bus even while the memory bus is busy accessing other memory. However, since every instruction requires accessing the ROM to get the next instruction, this means that most of the time, the cartridge bus is active. Thus, only a select class of instructions can trigger the prefetch unit.

As previously mentioned, there are three types of cycles that are relevant to the GBA, S, N, and I, and two of these types of cycles are affected by the region of memory that the bus tries to access. Some regions of memory are slow, like WRAM and ROM. Some are fast, like IWRAM and VRAM. But regardless of if the memory is fast or not, if code is executing in a region of memory, it will be continuously accessing memory in a row. S cycles are good for this: if you read one instruction, and no other memory, then you can read the next instruction quickly. Many instructions, however, interrupt this process. Imagine there’s a cursor that points to the current block of memory to be read. Moving it one step ahead is like pressing the arrow key. It’s already there, under your cursor, so seeking is fast. This is what S cycles are like. But if you have to move it somewhere else, that’s an N cycle, and then, notably, moving it back to read the next instruction is also an N cycle. All types of cycles take a base of 1 cycle, plus however many wait states are needed to read out that region of memory. What prefetch does is, while executing memory from ROM, if the cursor isn’t pointing at the ROM, it will keep reading sequentially from the ROM, until it’s moved back. Since it’s already been read, the wait states on both N and S cycles disappear, making it irrelevant that the cursor was moved.

The timing is very important here. Prefetch can only operate during I cycles, or N or S cycles which are not being used to access the ROM. Thus, if memory is being read from anywhere else on the system, one wait state disappears per cycle used for reading that memory. Once one set of wait states for a ROM load has gone by, the 16-bit value from that address is stored in the prefetch buffer, which can contain up to 8 values. Thus, during any I cycle, or during an N or S cycle that doesn’t touch ROM, prefetch can save time loading nearby ROM values.

The following diagram represents what happens without prefetch with the instruction sequence str r1, [r0] / nop, which loads the memory at the address in register r0, and stores it into r1, then does no operation for one instruction. In this diagram, we assume that r0 is not an address in ROM, and an address that has 0 wait states. Also, N cycles in ROM are assumed to be delayed with 3 wait states, and S cycles with 1.

This diagram requires a bit of explanation. The first line represents the clock. Each rising edge indicates the end of one clock cycle and the start of the next. The second and third lines are the memory bus: the second indicates the value that is currently on the bus, and the third line is the memory that is being requested. The fourth line shows that, in this diagram, the prefetch unit is disabled. In a subsequent diagram, it shows when the unit is in operation. The next line shows W the current operation of the cartridge bus. When it’s low, the bus is idle. When it’s high, it means the bus is waiting to get data back, and the data is shown when it is available on the bus. The sixth line shows which piece of data the CPU is currently processing. When it’s low, it means the CPU has stalled. The final line shows which instruction is currently being executed. Lastly, the type of each cycle is indicated. S cycles are blue, while N cycles are orange.

As can be seen, before the str (“store”) starts, the memory at the program counter (or pc) is loaded into the CPU. The str starts and requests the memory at r0. Since this memory is fast, it takes zero wait states and the operation completes in one cycle. However, it still moves the memory cursor and is an N cycle. Next, it tries to load the instruction for the end of the pipeline. Since it moved the cursor again, this is an N cycle and takes 3 wait states for the memory to reach the bus. On the final cycle, it loads the memory into the CPU and the next instruction can start. nop (“no operation”) does nothing, and loads the next instruction. This is an S cycle, so it must wait one wait state for the instruction to load, then it can continue. In total, these two instruction take 7 cycles to finish.

The next diagram shows what happens if prefetch is enabled on the same set of instructions. Nothing else has changed.

Here, the N cycle from loading r0 is still involved, but since it’s an N cycle that does not require accessing ROM, the prefetch unit can run. It triggers, via an S cycle, a load from the ROM. Since this is an S cycle, only one wait state is involved, so the load is finished the next cycle, which happens to be exactly when the CPU requests the next location in the ROM. Thus, the load, from the CPU’s perspective, only takes one cycle, and the wait states have disappeared. This saves three cycles, reducing the time of these two instructions to four cycles. Had the str taken an additional cycle, the wait state from the nop waiting for the next instruction would have disappeared, as well, reducing the nop to one cycle (although the previous instruction would have taken at least three).

All in all, cycle counting on the Game Boy Advance can get very complex, and, to date, no emulator has ever come close to emulating prefetch accurately. In the interest of furthering the field, I’ve begun work on a test suite for the Game Boy Advance. For now, it only contains timing tests, but it remains under heavy development until it’s a thorough test suite for all of the Game Boy Advance’s hardware. This test suite is available on GitHub, and a thread containing a version history as well as builds is available on the forums. The examples in this article can be found in the test suite as strh r3, [sp] / nop with configurations Thumb/ROM .NS and Thumb/ROM PNS. mGBA has made huge advancements on full emulation of prefetch as a result of this test suite, but there is much more than only the very basic interaction is illustrated here. Larger interactions get significantly more complex, and have led to issues, including performance regressions, in mGBA. The first version of mGBA to feature advanced prefetch emulation will be 0.3.0, and preliminary support is available in the nightlies.