Emulation Accuracy, Speed, and Optimization
A commonly discussed topic in emulation is the “accuracy” of an emulator. The term means how close the emulation is to the behavior of the original hardware. However, there is a significant amount of complexity hidden behind this simple term. There is no simple metric for what makes an emulator “accurate”. Accuracy in emulators is thought to mean it’s slower, but has fewer bugs. Less accurate emulators are often said to be faster and “good enough” for most games. While there is a kernel of truth to these claims, there is far more to the reality of the matter.
One of the most prevalent terms used to describe emulation accuracy is “cycle-accuracy”. The term has a specific meaning, but it is often misunderstood and over-broadly applied. Cycle accuracy, loosely, means that every single aspect of the emulated system occurs at the correct time relative to everything else. For many systems with tight timing and more direct access to the hardware, especially older systems, cycle accuracy is a key aspect of highly accurate emulation.
The word “cycle” in this term refers to the fundamental unit of timing in digital logic: the clock cycle. The commonly discussed MHz and GHz quantities describing systems and processors refer to the frequency of the clock on that system. Thus, for a system that has a 16MHz processor, such as the Game Boy Advance, this means that the processor runs 16 million cycles per second. Another important piece of hardware, the bus, sometimes has a different clock rate. A bus is an interconnect that transports data between various components of the system, such as between the CPU and main memory. While the bus often runs at the same speed as the CPU, such as it does in the GBA, this is not a guarantee: in the Nintendo DS, there are two processors, the ARM7 (the same CPU as the GBA) and the ARM9, which run at different clock speeds (approximately 33MHz and 67MHz respectively), however the bus runs at the same 33MHz as the ARM7. As a result, the ARM9 may have to wait on the slower bus to get data from memory that it could get more quickly if the bus were the same speed as the ARM9’s clock rate.
Cycle-count accuracy is a similar concept, but instead of every piece of hardware being emulated on the correct cycle relative to other components that act concurrently, each component acts atomically and takes the correct amount of time, but may not overlap properly with the timing of other hardware. As such, cycle-count accuracy may sound strictly inferior to cycle accuracy, and from a perfect hardware accuracy perspective, it is the case. However, cycle-count accuracy is much a much easier style of emulation to design, implement, and maintain. It is a common misconception that mGBA is now or will become cycle accurate, but to do so would require a major rewrite of some of the foundational elements of mGBA. When implemented well, cycle-count accuracy will produce very similar, and often identical, results.
The primary problem that cycle accuracy solves is correctly emulating different pieces of hardware performing actions on the same cycle. This may sound easy like an easy task: perform all of the individual steps that occur on a given cycle, in order, go to the next cycle, and repeat. This can become quite slow, and introduces the major difference in performance between cycle accuracy and cycle-count designs. Hardware performs all of these steps independently on any given cycle. Software is not able to perform these steps in parallel and must swap between operations. This is computationally expensive and thus slow. Let’s take a look at an example.
Assume for a moment that you have a simple CPU architecture where one operation on the CPU always takes 2 cycles, and the GPU reads a value from memory to a pixel every 1 cycle. In theory, this is emulated one cycle at a time. Perform the first half of the CPU’s operation, then do one GPU pixel draw. Next, do the second half of the CPU’s operation, and do another pixel draw. But there are many hidden complexities beneath the theory. You have to know what part of a CPU instruction happens on any given cycle, which is often implementation dependent and poorly defined. You also have to be able to store the half-finished operation to return to it later. That entails additional complexity over an atomic design, resulting in a performance hit. In an atomic design, operations are not split per cycle. Instead, each operation runs until completion, then further operations can be performed. However, if you get the order of things wrong, there can be visible consequences. One example is when memory accesses can interfere with the GPU operations, incorrect timing can result in incorrect graphics.
Despite these issues, atomic operations are the basis of cycle-count accuracy. When dealing with different pieces of hardware, cycle-count accuracy has the individual operations take the correct number of cycles, and has operations able appear as if they’re occurring in the past. Combined, these create a good, though not perfect, approximation of cycle accuracy. In a cycle-count accurate model, CPU instructions cannot be pre-empted or interrupted, and other pieces of hardware operations are performed in between CPU instructions. However, by being able to schedule these hardware bits to appear in the past, they can all still take the right amount of time. The primary downside of this approach is that concurrent operations which would interact on the original hardware cannot be properly interlaced. However, depending on the age and complexity of the system, such interactions may be exceedingly rare. While older systems are full of edge cases and complex interactions, newer systems are generally more carefully designed and contain protections that prevent such interactions from occurring at all. This makes newer systems a much more attractive target for cycle-count accuracy. It’s faster and almost always “good enough”.
But then, what qualifies as “good enough”? It’s a subjective and contentious subject. If a game is playable without significant, obvious bugs, most players may say it’s good enough. However, for speedrunners and TASing, any lack of accuracy is a problem. For a very long time ZSNES, while wildly inaccurate, was considered by a large swath of the SNES emulation community to be “good enough”. Even with low accuracy, many games can run near perfectly. There were always the edge cases where ZSNES fell apart entirely, but most popular games were emulated well enough that increased accuracy was not a priority. For many people, this was “good enough”. For some people, it still is. But it was far from perfect, and that sat poorly with some people.
This lead to the creation of what is the most well-known example of a cycle-accurate emulator: higan (formerly bsnes). It is legendarily accurate, but also infamously slow. This is in part due to the fact that it is cycle accurate. It uses co-routines to switch between portions of the emulation as needed, which has a lot of overhead. Because higan is the best-known example of a cycle-accurate emulator, it has led to the misconception that cycle accuracy is necessarily extremely slow. However, much of higan’s performance issues are because the emulation is not optimized for speed. This was an intentional decision on byuu’s part to make sure that the code is ultimately readable and understandable, as byuu maintains a strict code as documentation policy. Beyond being a means to play SNES games, byuu treats higan as a preservation project and documentation on the behavior of the SNES itself. It is possible to make a highly optimized and accurate SNES emulator that would be significantly faster than higan without sacrificing much accuracy; however, no one has done this. It’s quite a daunting task, after all.
When I started working on mGBA, my goals for it were for it to be both more accurate than VisualBoyAdvance, and also faster. While these two goals are often in opposition, there is much more to accuracy than just timing. Sometimes it comes down to issues such as incorrect graphics rendering, or invalid memory operations not being emulated properly. I found that VisualBoyAdvance had many areas that lacked great optimization, and many of the accuracy improvements required did not impact speed. With the increased speed, I also had overhead I could use to make accuracy improvements that did impact speed.
Unlike the GBA emulation, the Game Boy emulation in mGBA (also known as mGB), is designed with cycle accuracy in mind. Instruction emulation is divided into tasks which occur on individual clock cycles and in between these operations other hardware can be emulated. However, through many optimizations, such as batching operations instead of running them one at a time (and splitting up a batch as needed when there are concurrent interactions), mGB is quite fast. It is far faster than an unoptimized cycle-accurate implementation would be, without sacrificing accuracy.
A notable example of where cycle-count accuracy is an impediment is in the upcoming DS emulation’s GPU command FIFO. When writing commands to the GPU, the DS constructs a list of commands that have not yet been processed. New commands get appended into this list. However, the FIFO has a maximum size. When it fills up, writes block until the FIFO has enough space for the new commands. On real hardware, if the FIFO is full, the memory bus stalls, causing the ARM CPU to temporarily block in the middle of this instruction. The FIFO is then read by the GPU independently of the memory bus, and finally the memory bus can continue. Since CPU operations in medusa are treated as atomic, stalling the memory bus and processing the FIFO in the GPU in the middle of an instruction is not possible. Instead, the way medusa handles writing to a full FIFO is by caching the value to be written and telling the CPU that it cannot execute new instructions until the FIFO has space and the cached value is flushed into the FIFO. However, the ARM CPU has a set of instructions that allow you to write more than one value into memory at a time. This means that it is possible to write more than one command into the FIFO in a single instruction and the memory bus may stall in between writes: e.g., if you write 3 commands into the FIFO, but it can only fit 1, it will stall before the third one even tries to write. This poses a large problem for medusa. The current approach, which actually violates cycle-count accuracy, is to process enough of the FIFO immediately to write the new commands. This is one specific edge case, but it is incorrect behavior and will need to be addressed.
Going even further away from cycle-accuracy is the concept of high-level emulation, or HLE. Many video game systems, especially since the late ’90s, have programmable components that are not part of the emulated software itself: the components are part of the system itself and do not have major differences between games. Some examples are DSPs (as on the GameCube), system software (as on the PSP and 3DS), and microcode-programmable devices such as the RSP on the Nintendo 64. While these components can be emulated directly (referred to as low-level emulation or LLE), it is actually possible to get away with not emulating them step by step, instruction by instruction. By writing a custom implementation of the component that has the same visible effect to the hardware, but is run as native code, instead of emulated step by step, it is possible to significantly speed up the operation. One of the downsides is that synchronization with different hardware components and proper timing becomes nearly impossible. Furthermore, HLE implementations require significant amounts of research not just on the hardware itself, but also on the microcode that runs on that hardware. Early HLE implementations are often riddled with bugs that wouldn’t be present in LLE, and can be very difficult to debug. LLE implementations, however, require copies of the code to be emulated, which is not always easy to dump, and cannot be distributed with the emulator itself due to copyrights.
Tradeoffs between accuracy and speed are a difficult proposition. For older systems, accuracy is almost always preferable as they are already quite fast to emulate and usually contain tighter timing restrictions. With modern systems, HLE is practically required to emulate the systems at all. It’s a difficult balance, and there are advantages to both sides. In general, accuracy will have fewer issues in the emulated software and be more valuable for preservation, but for speed and emulating recent platforms, accuracy is not always a necessity.