Machine Architecture and Organization (sec 010)

CSci 2021 Lab 0xc: Profiling low-level optimizations

Introduction

This week's lab follows up on last week's introduction of profiling tools and uses them to explore two kinds of optimizations related to low-level hardware features that are often important in software: locality for cache efficiency and avoiding branch misprediction. In each of these two areas, we given you different code that implements the same functionality. You can look at the C and assembly code to predict which one will run faster, and then confirm your understanding of why using Valgrind-based profiling. To start, copy all of the relevant lab files with this command:

cp /web/classes/Spring-2020/csci2021/labs/0xc/{kcachegrind,matrix-sum.c,stringSrc.h,count-vowels.c,count-vowels-noif.c} .

Locality for caching

Our examples of cache efficiency use a program named matrix-sum which you can compile in the usual way:

gcc -Wall -g -O2 matrix-sum.c -o matrix-sum

This is similar but not quire the same as an example we used in lecture. The two functions sum_rows and sum_cols both add up all of the entries in a matrix, with sum_rows traversing by rows and sum_cols traversing by columns. However unlike the example in lecture which was a true multidimensional array, this program uses a two-level array, i.e. an array of row pointers.

The program uses a quite large array, which takes time both to initialize and to sum up. The program takes two command-line arguments to control which of the sum routines to run, the first for sum_rows and the second for sum_cols. So for instance the command matrix-sum 0 0 runs just the initialization, matrix-sum 1 0 does the initialization and sum_rows, and matrix-sum 0 1 does the initialization and sum_cols.

To start, look over the source code in matrix-sum.c to find the structure of inner loops. There are three inner loops that each access every element of the matrix: one for initializing the matrix, one in sum_rows, and on in sum_cols. These are the parts of the program that will be important for performance.
Predict the memory operations and their cache performance in the three inner loops. For a single iteration of each of the inner loops, how many memory accesses, either loads or stores, will be performed? Remember that multi-level arrays need pointer accesses in addition to data accesses; but since we're compiling with -O2, the compiler has performed loop-invariant code motion to move computations out of loops if they are the same on every iteration. What will be the L1 miss rate of the accesses. You should assume that the cache uses 64-byte blocks and that the L1 cache is too small to hold an entire row of the matrix.
Based on your analysis in the last part, which of sum_rows and sum_cols will be faster? Test your analysis by running time ./matrix-sum 0 1 and time ./matrix-sum 1 0.
Next, test your more detailed predictions about cache behavior using the summary statistics from Valgrind's cache simulation. Start with the following command:
```
valgrind --tool=callgrind --dump-instr=yes --cache-sim=yes --D1=32768,8,64 --LL=8388608,16,64 --callgrind-out-file=valgrind.out ./matrix-sum 0 0
```
To explain in more detail what that command is doing --tool=callgrind is the main Valgrind profiling tool. --dump-instr=yes requests instruction-level information, and --cache-sim=yes requests cache simulation. The --D1 and --LL options provide information about the L1 data cache and last-level cache that should be simulated: both have 64-byte blocks, and the L1 data cache is 32k and 8-way set associative, while the last-level cache is 8MB and 16-way associative. (Valgrind might print a warning about the L3 cache on your system, but you can ignore this because it's overridden by the --LL option.) The option --callgrind-out-file=valgrind.out just tells the program to always use the same output file name, which keeps things a little simpler that having multiple files with different number suffixes. The last part is the command to run the matrix-sum program in the mode where it just initializes the matrix, and doesn't actually some anything.

The last lines in Valgrind's output give statistics on the total numbers of various events it has simulated. The four lines you should concentrate on are the ones labeled I refs, which counts the total number of instructions executed; D refs, which counts the total number of data accesses; D1 misses, the number of data accesses that miss in the L1 cache, and D1 miss rate, the percentage of misses. The latter three statistics are further subdivided into rd (reads, aka loads) and wr (writes, aka stores). The 100 million elements of the matrix dominate all the other aspects of the program's running time, so you should look at large numbers as small multiples or fractions of 100 million, and ignore all the numbers under a million.

Looking at the statistics, you should be able to answer the following questions about the initialization inner loop:
1. How many instructions are executed in one iteration?
2. How many loads and/or stores are executed in one iteration?
3. What is the L1 miss rate of the inner-loop memory access(es)?
After you're confident you understand these statistics for the initialization code, repeat the experiment with the options to run each of the sum routines, and answer the same questions. The initialization code still runs, so subtract the operations from the initialization to look at the sum routines on their own.
To see more details about where the operations are coming from in the program, you can use the command ./kcachegrind valgrind.out to bring up the same GUI you used last week. Try showing Source Code in the upper-right pane, Machine Code in the lower-right pane, and switching between Instruction Fetch, Data Read/Write Access, and L1 Data Read/Write Miss as the statistics to view. The numbers shown are the percentage of all of those events in the execution that came from a particular line of code. In particular, look at the two memory accesses in the sum_cols inner loop. They come from the same source code line, but two different instructions. Why do they have their different miss rates?

Avoiding branch misprediction

Another hardware feature that can be important for performance is branch prediction. To illustrate this we've written a program count-vowels that counts how many of the letters in a string are lower-case vowels a, e, i, or u. The basic implementation in count-vowels.c shows one approach to this task that uses a structure of comparisons organized like a binary search tree. In the inner loop of this program, there is one branch from the null-terminator check, and then some subset of the comparison branches are taken depending on the character value. If we assume we're running this on a long string, which is when you care most about performance, the branch checking for a null character will be easily predictable because the character is almost never null. However, the comparison branches are designed to be evenly balanced between being true and false, and there is no simple pattern between letters in text, so the comparison branches cannot be well predicted. We'll see that this ends up hurting performance.

If you have the time, you might want to stop here to think about how you might rewrite the inner loop of count_vowels to show the compiler how the recognize vowels without branches. Essentially you want a structure that can be compiled into setCC or cmovCC instructions instead, and that probably means not using if statements. One answer is in the file count-vowels-noif.c, and we'll assume we're working with it in what follows, but if you've written your own implementation you can try it in the same way.

It turns out that this is a case where manual code changes interact in an interesting way with the compiler's optimizations, so let's actually compile three versions of this program. First we'll compile the original code with -O2. Then we'll compile the version without if statements at both -O0 (no optimization) and -O2. We'll see that at -O0 the compiler produces a straightforward translation of our code, while at -O2 it replaces it with something different and even better. Use the following three compilation commands:

gcc -Wall -g -O2 count-vowels.c -o count-vowels-O2
gcc -Wall -g -O0 count-vowels-noif.c -o count-vowels-noif-O0
gcc -Wall -g -O2 count-vowels-noif.c -o count-vowels-noif-O2

The performance of the real programs may vary between different machines, but in our testing (which you can check with time), the noif-O0 version was slightly faster than the baseline O2 version, despite having no optimization, and noif-O2 was by far the fastest.

To confirm in more detail how the branch predictor performs, we can use Valgrind's simulation of it too, using a command like:

valgrind --tool=callgrind --dump-instr=yes --branch-sim=yes --callgrind-out-file=valgrind.out ./count-vowels-O2

This is similar to the command we used for cache simulation, but we've removed the cache simulation options and replaced them with --branch-sim=yes.

The key statistic we're interested in is printed on the last line, the misprediction rate for conditional branches. Valgrind also simulates the misprediction rate for indirect jumps (labeled ind), but for this program they're too rare to matter. The baseline version of the program has a 19.7% misprediction rate in Valgrind's simulation, which is not very good. Remember that one of the branches in the inner loop (the null check) is almost 100% predictable, and it's rare to predict conditional branches much less than 50% (just like it's hard to get much less than 50% on a true-false test). So if the inner loop had just one comparison that was completely unpredictable, the misprediction rate would be around 25%. 19.7% is a bit lower than that, but not by much.

The noif-O0 version fulfills the narrow goal of avoiding branch misprediction very well. The vowel counting is achieved without branches, and the null check is very predictable, so the misprediction rate rounds to 0.0%. (The total number of branches, about 145 million, is almost exactly the number of characters processed.) However the total number of instructions is more than twice as much as the baseline, which has a countervailing negative performance effect.

The noif-O2 version has the smallest number of instructions and less than half the mispredictions of the baseline, which is enough to give good performance even though there are still some mispredictions. There are two branches per character in the inner loop, the usual predictable one, and another that is mispredicted about 30% of the time. You can investigate what's going on by looking at the compiler's assembly output:

gcc -S -O2 count-vowels-noif.c -o count-vowels-noif-O2.s

Here's what the inner loop of count_vowels looks like for one compiler version:

        movl    $1065233, %edx
.L4:
        subl    $97, %ecx
        cmpb    $20, %cl
        ja      .L3
        movq    %rdx, %rax
        shrq    %cl, %rax
        andl    $1, %eax
        addl    %eax, %r8d
.L3:
        movzbl  1(%rdi), %ecx
        addq    $1, %rdi
        testb   %cl, %cl
        jne     .L4

Can you figure out what this code is doing? You might notice that 97 is the ASCII code for lowercase a, and 1065233 is 100000100000100010001 in binary. Why is there a shift involved? The unpredictable branch is an unsigned comparison checking whether something is greater than 20, what's it for?