Development of Secure Software Systems

CSci 4271 Lab 5

The first half of today's lab is a series of 7 shorter problems that form an obstacle course of challenges related to buffer and integer overflow. The 7 problems are embedded into a single binary in a way that might be reminiscent of the 2021 Bomb Lab, if you did that activity. Then the second half walks though carrying out a format string attack. This almost certainly too much to finish in 50 minutes, so you might want to consider either picking one of the topic areas to focus on in lab and saving the other to look at later.

Overflow obstacle course

Each stage of the obstacle course is a C function that reads a line from the input file into a buffer that might be too small. To make the memory layout more predictable, the vulnerable buffer is always in a structure, and the amount of the overflow is limited to at most the size of the structure. (This also means you won't be able to overwrite a return address with your overflow.) Another restriction is that the data used in the overflow is limited to printable ASCII characters, the ones with byte values 0x20 through 0x7e. But you can control the exact amount of the overflow based on how long you make the corresponding line in the input file, and the overflow doesn't include any terminator character (no \0 terminator, and no newline). Note that this feature of the input only being printable is relatively unusual in realistic situations: we limited this exercise to printable characters to give you more constraints, but as you saw in Lab 3, for instance, practical attacks often involve binary data.

Speaking of ASCII characters, you will probably find it helpful for the challenges to go back and forth between decimal or hexadecimal byte values and the printable characters they represent. Here's a table:

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex  
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

There are also many ASCII tables and tools for conversion that you can find online: a previous semester's TAs recommends this one. Another way to see both the hexadecimal and text character interpretations of your attack inputs is to look at your attack input file in a hex editor like ghex.

There are seven stages, so your final input file will consist of seven lines, one for each stage. As a starting point, we've given you a template file that has 7 lines that are similar to the input you'll need to supply, but which don't cause any of the attacks to succeed. Your job is to modify each line in the file into an input that will cause the corresponding stage of the obstacle course to succeed. We've tried to arrange the stages in order of increasing difficulty, though you don't have to do them in order. Stages 1-5 print some extra information when they fail (and stage 1 even when it succeeds), which you may find helpful to understand what's going on. A debugger shouldn't be necessary, but you can use one if you'd like. The final stage 7 is a bit special in that you can easily make the program crash or behave weirdly with a failed attack, and you probably will want to use GDB or another disassembler for it.

This lab also puts more stress on a conceptual challenge that was also present in last week's lab, namely thinking about how different interpretations of data in files and in memory interact. The basic structure of both files and memory for the purpose of this lab is a sequence of bytes, each of which is 8 bits that can represent a number between 0 and 255, a two-hex-digit number in hexadecimal, or one ASCII text character. When those bytes are written into memory, they might be interpreted in other ways, like a 4-byte int or an 8-byte pointer (because x86-64 is little-endian, these interpretations might feel backwards from what you'd first expect). Because the attacks in this lab are limited to printable characters, it is possible to write your attack input just using a plain text editor. But the other perspectives on how the bytes in your attack are interpreted are important for understanding how the attacks work. So you may find it helpful to look at your attack input file with the hd command, edit it with a hex editor, and/or explore different interpretations for the overwritten data by using the x (examine) command in GDB.

Remember that the purpose of the lab is to help you understand the underlying concepts, not just to solve the puzzles, so if you get an attack to succeed by random guessing, take a moment to try to understand why the attack works before going on to the next one.

You can start by copying the program binary, source code, and solution template into your current directory:

cp /web/classes/Spring-2024/csci4271/labs/05/overflow-course{,.c} .
cp /web/classes/Spring-2024/csci4271/labs/05/template.in my-soln.in

You can run the program as follows:

./overflow-course my-soln.in

You can start work on your first attack by looking for the function stage1 in the source code.

(Format) Background and setup

The other side of today's lab will go in depth on another kind of vulnerability that had come up in lecture and the attack techniques for it, a format string injection. Similarly as we've done to simplify other control-flow hijacking examples, you won't need shellcode: instead your goal is to transfer control to the attack_function function. We also recommend that you spend some time (maybe 10-15 minutes) at the beginning of the lab practicing auditing the code to find the information you need to attack the vulnerability. But we'll give out suggestions along these lines on a separate linked page. You could also spend longer on looking for the vulnerability if you feel that's what you need the most practice with, but if you do that, we'd recommend coming back to the attack techniques later.

The basic way that a format string vulnerability that allows an attacker to use the %n format specifier works is to provide what we call a write-what-where attack primitive. In other words, the vulnerability gives the attacker the capability to write a value of the attacker's choosing, to a location of the attacker's choosing, sometimes with some other restrictions. You may recall that the intended benign purpose of the %n feature of printf family functions is to allow a program to keep track of how many characters printf has written up to a given point, returning the value via a pointer. When this feature is used for an attack, the value that printf interprets as the argument corresponding to the %n format specifier is the "where" of the primitive (the location that will be written to), and the "what" of the primitive, the value that will be written, is the number of characters printf has written so far.

For this vulnerability, the where and the what both have pretty flexible control for an attacker, but they are limited in the size of their values. Because of this limitation, it will not work to try to overwrite a return address on the stack, which might otherwise be your first thought of a location it would be useful for an attacker to replace to take control of the program. Instead we need to rewrite some other data value that the program will use as a control-flow target. If there was a function pointer in the program, it might be a candidate. But what we recommend you use today is instead a value that the program uses for getting to a function in a shared library.

Start by copying the program source code to your working directory, with a command like:

cp /web/classes/Spring-2024/csci4271/labs/05/bclpr.c .

Because this is a simulation of a program that would be installed in a system-wide location if you controlled the full computer, we need to do a bit more to simulate "installing" it so that it will run correctly. As a location that you will have access to but will be unique even on a shared computer, we recommend that you compile the program to use a location similar to /tmp/bclpr-goldy007, but where goldy007 is replaced with your UMN ID. /tmp is a directory for temporary files that everyone has access to, while using your UMN ID makes it unique. You'll need to make this change both on line 24 of the source that defines the INSTALL_PREFIX macro and in the commands below that create directories the program will use. Note though that our attack is only going to use addresses in the main program, so you can just depend on the -no-pie option and you don't have to do anything to disable ASLR.

gcc -no-pie -z execstack -g -Wall -Wno-format-security -fno-stack-protector bclpr.c -o bclpr
mkdir -p /tmp/bclpr-goldy007/spool/lp0
mkdir -p /tmp/bclpr-goldy007/printouts/lp0

(Format) Auditing suggestions

There are two ingredients that are needed to constitute a format-string vulnerability, and then one other choice you need to make based on the program to set up your attack. Here we'll talk about what you're looking for, and our proposals on a separate page to check your answers of if you want to jump ahead.

A printf-family function with a non-constant format string. The most common way to use printf is for the first argument to be a constant string containing format specifiers starting with a percent sign. But if this string is constant, it can't be affected by an attacker. Only if the format-string argument is a non-constant (usually a variable) can the call be vulnerable.
A way for the attacker the control the contents of the format string. A non-constant format string is potentially dangerous, but it will generally only be of use to an attacker if the attacker can supply their own value for the format string. So for instance if the format string argument is a variable, look back at how that value can be computed to see if it could be under an attacker's control. To make a working attack you'll need to exercise that control.
Another attacker-controlled value that will be interpreted as a later argument to the printf-family function. To match up with a %n the attacker puts in a format string, they also need to control another value to be interpreted as the argument to the printf-family function. If the program is already passing other arguments to printf, they might be controllable. Failing that, having a lot of format specifiers in the format string will cause printf to read beyond its internal array on the stack to interpret other values on the stack as arguments. So look for attacker-controllable-values is the stack frame of the calling function, its caller, and so on. On x86-64, the argument corresponding to %n needs to be an 8-byte value that contains several null bytes, so it usually won't work for this value to be part of a null-terminated string.

Our proposed answers to the above three questions are outlined on a a separate page.

More format attack techniques

You can confirm that you are successfully injecting a format string by putting format specifiers in the string and seeing what output results. In particular, try a format string that contains many copies of the format specifier %016lx to print various parameters from the stack as fixed-size 64-bit hex values.

Before getting to the attack proper, let's take a look at the function-pointer-like shared library mechanism that we will be modifying in our attack. For each function from a shared library that is called by the main program, the compiler creates as small intermediate function called a "PLT stub" (PLT stands for Procedure Linkage Table). The value that this stub function uses to find the actual implementation of the function in the shared library is an entry in the GOT (Global Offset Table). The entries in the GOT used by PLT stubs are sometimes more specifically called the .got.plt section. For the case of this attack, we need to change the entry that corresponds to the location of a function that will be called after the execution of the printf-family function that we control. The library specific function we suggest you use for this purpose is fclose, which you can see is called on line 503, soon after the vulnerable fprintf on line 500.

You can see the GOT entry used, and observe how it works, by looking at the execution of the relevant PLT stub in GDB. Here's a transcript you can try, where you can see that the name of the PLT stub for fclose is 'fclose@plt' (the single quotes are not part of the name proper, but are needed for GDB because normal C function names can't contain @.

% echo 'Hello, world!' >hello.txt
% gdb --args ./bclpr hello.txt
(gdb) disass 'fclose@plt'
Dump of assembler code for function fclose@plt:
   0x0000000000401390 <+0>:	endbr64 
   0x0000000000401394 <+4>:	bnd jmp *0x3cc5(%rip) # 0x405060 
   0x000000000040139b <+11>:	nopl   0x0(%rax,%rax,1)
(gdb) p *(void **)0x405060
$1 = (void *) 0x4010c0
(gdb) watch *(void **)0x405060
Hardware watchpoint 2: *(void **)0x405060
(gdb) run
Starting program: bclpr hello.txt

Hardware watchpoint 1: *(void **)0x405060

Old value = (void *) 0x4010c0
New value = (void *) 0x7ffff7e29e00 <_IO_new_fclose>
0x00007ffff7fd5fef in _dl_fixup (l=0x7ffff7ffe2e0, reloc_arg=) at ./elf/dl-runtime.c:163
163	./elf/dl-runtime.c: No such file or directory.
(gdb) c
Continuing.
Deleting spoolfile from /tmp/bclpr-goldy007/spool/lp0/hello.txt
[Inferior 1 (process 590385) exited normally]
(gdb) quit

The key information needed for this aspect of the attack is the address of the relevant GOT entry, which you can see from the disassembly is 0x405060. The initial value of this pointer is a placeholder that triggers the dynamic linker to determine the real address the first time the function is executed. That code (specifically the function _dl_fixup where the watchpoint was triggered) is responsible for updating the GOT entry to point to the location of the implementation of fclose in the C library, which is 0x7ffff7e29e00 on this execution.

The "where" in a format-string write-what-where attack needs to be a value on the stack, in a location where fprintf would look for an argument corresponding to a format specifier, which is under the control of the attacker. This is the thing that we will want to set to the address of the GOT entry.

Next, the "what" of our format-string attack needs to be the address of attack_function, and that needs to be the number of bytes that fprintf writes before it gets to the %ln format specifier we've added. You can get the address of the function inside GDB or with the program nm. Note that it will also be useful to know this value in decimal.

% nm bclpr | fgrep attack_function
00000000004028d4 T attack_function
% printf '%d\n' 0x4028d4
4204756

How can we make fprintf output more than four million characters before getting to the crux of our attack? A format string that was itself that long might be too long to pass on the command line. More conveniently we can use the same printf feature we already used in %016lx: a leading zero and then a number in the format specifier causes the length of the output to be padding with leading 0s up to that number of 0s if it is not already that long. 16 is a logical number to use for a 64-bit hex value, but there is no upper limit on this padding size, so it can easily be in the millions.

The final slightly tricky aspect of this attack is making the amount of data output by printf come out to exactly the right amount, since it depends on all the things in the format string up to the point of the %ln. If you followed our suggestion of using %016lx for printing the earlier stack values, than each of them will be 16 bytes long, but don't forget to also count any spaces, newlines, or other separating characters you print. You could also try copying the data out of the log file to count its size. If you get the size wrong, the attack won't work, but you can use the same GDB watchpoint we illustrated earlier to see what value the GOT entry is being overwritten with, and adjust is accordingly.

Another note: the basic printf format specifier for writing the number of bytes output is %n, but for this attack it is important that you use the variant %ln instead. The l causes printf to interpret the pointer as a long * instead of an int *, which is important because we want to overwrite all 8 bytes of the pointer. We saw that the normal value of the location of fclose has the high-bits non-zero.

There is also one more printf feature you may find it useful to experiment with in building your attack, which can make the attacking format string shorter. The version of the printf functions on Unix systems allows you to give format specifiers out of order. If you put a decimal number and then a dollar sign right after the percent sign of the format specifier, that format specifier will take its value from the position corresponding to the number (counting from 1), rather than what would normally come next. For instance while %d will print the next argument as a decimal int, %10$d will print as a decimal int the value that would normally have corresponded to the 10th format specifier had the dollar sign feature not been used.