Mechanical Sympathy: Coding for CPU Performance
omnivore computer-architecture good-read!
Read on Omnivore | Read Original
Highlights
Modern processors employ sophisticated mechanisms for executing instructions efficiently. Writing code that takes advantage of these mechanisms for delivering maximum performance is what I call “hardware-aware coding”. ⤴️
Instruction Pipelining ⤴️
A single instruction during its execution doesn’t use all the processor resources.
For example, if the instruction is accessing memory then the execution units (adders, multipliers) remain unused. ⤴️
Instruction processing is pipelined in the processors to enable them to process multiple instructions simultaneously. The pipeline consists of several stages and as one instruction moves from the first stage to the second, it makes space for a new instruction to be introduced into the pipeline. ⤴️
simpler architectures consist of a five stage pipeline while there are also more complex pipelines in high performance architectures (such as x64) that have very deep pipelines consisting of 15-20 stages. ⤴️
Fetch: An instruction is fetched from memory. ⤴️
Decode: The instruction is decoded to identify the operation (e.g. add, multiply, divide) and its operands. In CISC style processors (e.g. X86), the instruction is also broken down into one or more simpler instructions, called microinstructions (μops). ⤴️
Execute: Decoded μops are picked up and executed (assuming all the required operands are available). ⤴️
Memory access: Any memory reads or writes required by the instruction are done ⤴️
Write back: The final instruction result is written back to the destination ⤴️
while an instruction is being executed, others might be getting fetched, and decoded ⤴️
Superscalar Instruction Execution ⤴️
modern processors have also been equipped with multiple execution resources. For example, they have multiple ALUs for arithmetic computations, and multiple load/store units for memory operations. But, utilizing these requires that the CPU can issue multiple instructions in parallel ⤴️
Modern X64 processors can issue up to 4 new instructions each cycle and with their deep pipelines, they can have hundreds of instructions in different stages of completion. Under a steady state, such a processor can deliver a throughput of up to 4 instructions/cycle (peak instruction throughput). ⤴️
Out-of-Order Execution ⤴️
instruction level parallelism (ILP) requires that the processor is able to find enough independent units of work which can be executed in parallel in the same cycle. ⤴️
processor looks at a larger window of instructions to identify independent instructions that can be executed in parallel.
This naturally means that the processor executes instructions out of their natural program order. It is called out-of-order execution and it enables processors to find and execute multiple instructions in parallel. Even though the processor executes instructions out-of-order, their results are committed to the process state in their original order, it ensures that the program logic remains correct. ⤴️
idea of mechanical sympathy. By organizing our code in a way which works in synergy with the way the hardware works, we can get maximum performance possible. ⤴️
In order to fully utilize the superscalar capabilities of the hardware, there have to be enough independent instructions in the code. But often, there are situations where the code is written such that we have a sequence of instructions where the next instruction depends on the previous one’s result. ⤴️
optimization to fix such a bottleneck is called “loop unrolling”. It basically means executing multiple steps of the loop in a single iteration. ⤴️
Usually, we don’t have to do such kind of optimization ourselves, the compilers these days can do this for us. However, when you notice below par instruction throughput, you may want to check the compiler’s output to see what is going on. ⤴️
observation that usually at certain hours of the day, certain items are in demand and in that case keeping their ingredients helps preparing these items faster. The technical name of this property is temporal locality, which means an item which is required now is very likely to be required in the near future as well. ⤴️
limited space, so when they run out of space, the restaurant needs to throw away some of the ingredients to make space for the ones which are required right now but not present in the cache. This is called the cache eviction policy. ⤴️
Apart from the local shelves and fridges, the restaurant also builds a bigger cache in the basement from where any ingredient can be fetched in 2 minutes. And, whatever is not present even in the basement, gets delivered from the market. This gives rise to a hierarchy of ingredient caches, each with larger capacity and larger access times. ⤴️
the CPU needs the operand values for an instruction to be available in the CPU to execute them. ⤴️
All the program data resides in the main memory, and it takes 100-200 cycles to read any data from there. In the above example, if both the values need to be fetched from main memory, the simple add operation will take 200-400 cycles which is extremely slow. For comparison, the CPU can do the addition within a single cycle if the operands are in its registers, so this is 200x slower at the very least. ⤴️
processors come with a hierarchy of cache memories. These days we usually have three levels of caches in the processors, called L1, L2 and L3 caches–each larger and slower and farther from the CPU than the previous level. ⤴️
access latency of the L1 cache is usually 3-4 cycles which is significantly faster than main memory but it is of very small capacity, 32-64 kB on modern X64 hardware. ⤴️
processors also keep the most recently used data in the L1 cache, anticipating temporal locality. ⤴️
As these caches are small, they also employ an eviction policy to make space for new data. Most of the processors implement some approximate variant of the least recently used eviction (LRU) policy, which means evict cache lines which have not been accessed for a while. ⤴️
how to profile all of this? where can i see inefficiencies that i can then optimise? esp for gpus?
Spatial Locality of Data ⤴️
the memory bus can be requested for data of only a single location in main memory. This means that accessing values at different addresses requires multiple trips and costs hundreds of cycles. ⤴️
caches are optimised for spatial locality of data. They assume that the related data is stored contiguously or nearby in memory and instead of bringing just the value at the requested address into the cache, the cache brings an entire block of data. The block size depends on the memory bus capacity, which is usually 64 bytes on present day hardware. ⤴️
CPU caches are also organized to cache these entire blocks rather than caching the exact value that the application requested. Within the cache, these blocks are called cache lines and these are also 64 bytes in size (on X64 hardware at least). ⤴️
When an application requests an 8-byte long value stored at memory address 130, the hardware doesn’t retrieve only those specific bytes. Instead, it fetches the entire 64-byte block starting at address 128 (the nearest 64-byte aligned address) and loads this block into one of the available cache lines. Subsequently, if the program attempts to access any value within the address range of 128-191, that data will already reside in the cache, eliminating the need for additional time-consuming trips to main memory. ⤴️
Mechanical Sympathy with the Caches
These caches in the CPU improve the system performance only when they are used effectively. They are designed anticipating temporal and spatial locality, but if things are more random, they become ineffective.
Taking advantage of the caches requires careful consideration of the data structure layout and memory access patterns. ⤴️
in the case of data structures, sometimes we need to optimize the layout of our data structures to take advantage of the spatial locality. The fields which are usually accessed together should be kept together in the object definition so that when one of those fields is accessed, the remaining fields are also cached as part of the same cache line. ⤴️
related optimization about data structure layout is keeping the read-only and read-write fields in separate cache lines. Whenever a field is modified, the entire cache line containing other fields and values becomes dirty. If some of the other processor cores have also cached the same cache line to access the read-only fields, their cache line becomes invalid. The next time these cores try to access this cache line, they will have to sync the latest value using cache coherency protocols, which adds a delay to the cache access process. ⤴️
Another situation where you can apply your understanding of cache mechanics is in optimizing your data access patterns - the specific ways your code retrieves and manipulates data. ⤴️
Developers often overlook their data access patterns when designing code. A typical scenario involves reading a specific value, which triggers the loading of an entire 64-byte cache line. However, instead of immediately utilizing the neighboring data now readily available in cache, the program executes unrelated operations. By the time the code requires additional values from that same cache line, it may have already been evicted, forcing another costly memory fetch. ⤴️
in some cases optimizing compilers can detect such poor data access patterns and they can generate code that is cache friendly. ⤴️
modern processors have cache prefetching which means that the hardware detects the access patterns and starts to prefetch the cache lines before they are accessed to prevent cache misses. ⤴️
speculation comes with costs. Whenever the prediction is wrong, whatever they had on the order pipeline needs to be thrown away and they have to start from scratch to prepare the right item. This adds extra delay ⤴️
speculative instruction execution. While the hardware can execute instructions very fast, fetching new instructions from memory takes time. To keep the execution resources in the processor busy, the instructions must be supplied at a fast rate. ⤴️
program structure is not always linear. All non-trivial programs consist of branches in the form of if/else blocks, switch cases, loops, and function calls. ⤴️
avoid this performance degradation, the CPU implements branch predictors to predict the target address of branches. ⤴️
whenever the branch predictor is wrong, there is a penalty on performance as well. When the processor finally evaluates the branch condition and finds that the branch predictor did a misprediction, it has to flush the pipeline because it was processing the wrong set of instructions. Once the pipeline is flushed, the processor starts to execute the right set of instructions. ⤴️
Modern processors have sophisticated branch predictors which can learn very complex branching patterns and offer accuracy of up to ~95%. But they still depend on predictable branching patterns in code. ⤴️
Under the hood, the branch predictor tries to associate the jump result (jump taken/not taken) with the address of the branch instruction. It initially starts with a default guess (e.g. branch not taken) and then based on the actual branch condition result it updates itself. So if the branch is usually taken, the branch predictor eventually learns to predict that.
However, if the branching condition is random, the predictor will have a hard time learning it. ⤴️
Instruction pipelining and superscalar execution enable multiple instructions to be processed simultaneously, but only when we provide independent operations ⤴️
Memory hierarchy and caching dramatically reduce latency, but only when our data structures and access patterns exhibit good locality ⤴️
Speculative execution keeps the pipeline full through branches, but becomes a performance liability with unpredictable branching patterns ⤴️
The most effective optimization approach is to:
- Write clean, maintainable code first
- Profile to identify performance bottlenecks
- Apply mechanical sympathy principles specifically to those hot spots ⤴️