Machine Architecture and Organization (sec 010)

CSci 2021 Lab 0xB: Profiling Tools

Introduction

In this lab we will be experimenting with profiling tools. Profiling tools supply information on how a program executes and measures which parts of a program are taking the most time. This can give us great insight as to where we might need to optimize our code. In this lab we will use two profiling tools, gprof, and kcachegrind. We have written some examples of inefficient code for you to examine and profile. It will be your job to run these programs through gprof and kcachegrind to detect where things are getting hung up. To start, copy all of the relevant lab files with this command:

cp /web/classes/Fall-2018/csci2021-010/labs/0xB/{string.c,matrix.c,kcachegrind} .

gprof

Let's start by examining the file string.c. This file takes two arguments, an integer followed by a single character. Read through the code to try to get an idea of what it is doing, and try to predict which functions might be the most taxing. Next, lets run it under gprof and find out. To use gprof, you first need to compile the code with the -pg flag.

gcc -pg string.c

This flag generates extra information used by gprof. Next you need to actually run your code. Since this file takes some inputs, make sure to supply those. Also make sure to choose a rather large number to ensure a long enough runtime for gprof to analyze. Try to run the program with many different input sizes to see how execution changes as size increases. Here's an example run:

./a.out 10000 j

Now, if you run an ls command, you should notice a new file, called gmon.out. This is a result of the a.out execution. Now, we can finally examine the program with gprof.

gprof a.out gmon.out

This will print the gprof results to the terminal. If you would prefer the results to be written to file, instead run the command:

gprof a.out gmon.out > analysis.txt

Obviously, there is a lot to look at here. The main things you want to observe are the two charts. The first is the flat profile, and second is the call graph. Both are relatively self explanatory, and there is also lots of provided information beneath each chart. You should notice that one function seems to be eating up a lot of time. How many times is it called? Look back at the code and see if you can figure out why this is happening and if there is a way to improve it. If time allows, try to rewrite the function to make it more efficient.

kcachegrind

Next let's try out kcachegrind on the second example, matrix.c. Once again, start off by examining the source code and make some predictions. The next step is to compile matrix.c. You don't need to use the -pg flag this time so just a simple gcc matrix.c will do the trick. Or if you want to name the object file, gcc matrix.c -o matrix. Of course, 'matrix' can be replaced with any name you like. Once you have an object file created, you will want to run it through valgrind using the callgrind tool (this might take a while to finish):

valgrind --tool=callgrind --cache-sim=yes ./a.out

This will create another file titled callgrind.out.[some number]. This is the file you will use with kcachegrind. To open up kcachegrind use:

./kcachegrind callgrind.out.[some number]

After it loads up, you should be greeted with extensive profiling window. There is a lot to look at, and we encourage exploration. To get you started, we'll point out some of the main features. The first thing you might want to look at is the call graph. This is in a tab at the bottom of the screen. This shows a graph of all of function calls. You might notice that sum_cols and sum_rows have the same percentage. That is because the current statistic being examined is the number of instructions. To change this, look up at the top. There is a dropdown menu that is currently set to "Instruction Fetch". Click on this and select "Cycle Estimation". This will provide some new insights. Right away you should notice that in the call graph, sum_cols and sum_rows have much different percentages. To understand what this percentage is based on, click on the "types" tab in the middle display. At the bottom you will see the formula used to calculate the cycle estimations. It is a combination of instruction fetches, along with cache level 1 misses, and last-level cache misses. With that in mind, think about why the cache system might be affecting the performance of these two functions. If you aren't sure, just ask a TA. Once again, spend some time exploring all the tabs and tools in the kcachegrind window.