C++ Tutorial - Taste of Assembly - 2020
Here is the most simplest C code:
// main.c int main() { return 0; }
After we do:
$ gcc main.c
we get a.out (Assembler Output), and if we open it with an editor, what we see is something like this:
^?ELF^A^A^A^@^@^@^@^@^@^@^@^@^B^@^C^@^A^@^@^@~@~B^D^H4^@^@^@ ... ^@^@^@^D^@^@^@/lib/ld-linux.so.2^@^@^D^@^@^@^P^@^@^@^A^@^@^@ GNU^@^@^@^@^@^B^@^@^@^F^@^@^@ ...... ....
The a.out is one of the output file formats. But as we see from the editor it's now written using ELF (Executable and Linkable Format) format in most of the linux systems.
Actually, executable code is created by the compiler and the linker, but it's the linker that puts things in the binary format. On Linux, this format is typically ELF, on Windows it's PE (Portable Executable), a modified version of the Unix COFF file format, including object code and DLLs, and on Mac OS X it's Mach-O.
It contains metadata, and multiple sections of code and data. They will be loaded into memory. The a.out is written that's can not be easily read , however, if we use objdump, we can see how the a.out looks like after being translated machine code for the main() function.
$ objdump -D a.out | grep -A20 main.: 08048354: 8048354: 8d 4c 24 04 lea 0x4(%esp),%ecx 8048358: 83 e4 f0 and $0xfffffff0,%esp 804835b: ff 71 fc pushl 0xfffffffc(%ecx) 804835e: 55 push %ebp 804835f: 89 e5 mov %esp,%ebp 8048361: 51 push %ecx 8048362: b8 00 00 00 00 mov $0x0,%eax 8048367: 59 pop %ecx 8048368: 5d pop %ebp 8048369: 8d 61 fc lea 0xfffffffc(%ecx),%esp 804836c: c3 ret 804836d: 90 nop 804836e: 90 nop 804836f: 90 nop 08048370 <__libc_csu_fini>: 8048370: 55 push %ebp 8048371: 89 e5 mov %esp,%ebp 8048373: 5d pop %ebp 8048374: c3 ret
The binary a.out's instructions are written in machine language that CPU can understand. Compilers are designed to translate C code into machine language for a variety of processor architectures. In our case, the processor is in a family of using x86 architecture (xeon)
$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz stepping : 6 cpu MHz : 1995.080 cache size : 6144 KB ....
Each architecture has a different machine language, so the compiler acts as a middle man, translating C code into machine language for the target architecture.
Let's look more detail about out compiled executable binary for the x86 architecture.
The objdump here is piped into grep so that the output has only show 20 lines after the main.:. In the output, each byte is represented in hexadecimal.
The hexadecimal numbers, starting with 0x08048354 on the far left, are memory addresses. We'll look each machine instructions later, but for now, let's look at the register info using GDB:
$ gcc -g main.c $ gdb -q ./a.out Using host libthread_db library "/lib/libthread_db.so.1". (gdb) break main Breakpoint 1 at 0x8048362: file main.c, line 3. (gdb) run Starting program: /home/a.out Breakpoint 1, main () at main.c:3 3 return 0; (gdb) info registers eax 0xbfec4204 -1075035644 ecx 0xbfec4180 -1075035776 edx 0x1 1 ebx 0x4a571ff4 1247223796 esp 0xbfec4164 0xbfec4164 ebp 0xbfec4168 0xbfec4168 esi 0x4a430ca0 1245908128 edi 0x0 0 eip 0x8048362 0x8048362eflags 0x282 [ SF IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 (gdb)
The bit of the machine language instructions must be put into memory, and the layout of the memory is shown in the next section.
The picture above is the memory diagram, in which memory is divided into several regions.
Local variables declared as part of a procedure or function appear on the
top part of the diagram, which represents the stack storage or automatic storage.
Also, stack is the place where function parameters are stored. Stack is also used for storing the return address of the calling functions, and stack keeps the register contents and return address when an interrupt service routine is called.
Modern architectures typically assign lower memory addresses to addresses in the heap than they do to addresses the stack.
We have other regions in the diagram:
Data region stores global/static variables. When we start a C++ program, the compiler sets aside memory for our code (code storage or text storage), and the compiled assembly resides in the Code segment.
In C, statically-allocated variables without an explicit initializer are initialized to zero or a null pointer. Implementations of C typically represent zero values and null pointer values using a bit pattern consisting solely of zero-valued bits. Hence, the bss (block started by symbol/block storage segment) section typically includes all uninitialized variables declared at the file level (outside of any function) as well as uninitialized local variables declared with the static keyword.
The rest of the computer's memory is available for other uses. It is free. It is free store, and also called heap. Dynamically allocated memory created using the new/malloc operator appears on the bottom of the diagram, which represents the region of memory
called the heap.
Heap memory, also known as dynamic memory, is an alternative to local stack memory. Local memory is so automatic that it is allocated automatically on function call and it is deallocated automatically when a function exits. Heap memory is different in every way. The programmer explicitly requests the allocation of a memory block of a particular size, and the block continues to be allocated until the programmer explicitly requests that it be deallocated. Nothing happens automatically. So the programmer has much greater control of memory, but with greater responsibility since the memory must now be actively managed.
The advantages of heap memory:
- Lifetime:
Because the programmer now controls exactly when memory is allocated and deallocated, it is possible to build a data structure in memory, and return that data structure to the caller. This was never possible with local memory which was automatically deallocated when the function exited. - Size:
The size of allocated memory can be controlled with more detail. For example, a string buffer can be allocated at run-time which is exactly the right size to hold a particular string. With local memory, the code is more likely to declare a buffer size 1000 and hope for the best.
The disadvantages of heap memory:
- More Work:
Heap allocation needs to be arranged explicitly in the code which is just more work. - More Bugs:
Because it's now done explicitly in the code, realistically on occasion the allocation will be done incorrectly leading to memory bugs. Local memory is constrained, but at least it's never wrong.
However, there are many problems that can only be solved with heap memory, so that's that way it has to be. In languages with garbage collectors such as LISP and Java, the above disadvantages are mostly eliminated. The garbage collector takes over most of the responsibility.
The heap is a large area of memory available for use by the program. The program can request blocks of memory for its use within the heap. In order to allocate a block of some size, the program makes an explicit request by calling the heap allocation function. The allocation function reserves a block of memory of the requested size in the heap and returns a pointer to it.
Suppose a program makes three allocation requests to 25 allocate memory to hold three separate images in the heap each of which takes 1024 bytes of memory.
Each allocation request reserves a contiguous area of the requested size in the heap and returns a pointer to that new block to the program. Since each block is always referred to by a pointer, the block always plays the role of an object that the pointer can points to and the program always manipulates its heap blocks through pointers. The heap block pointers are sometimes known as base address pointers since by convention they point to the base, lowest address byte, of the block.
In this example, the three blocks have been allocated contiguously starting at the bottom of the heap, and each block is 1024 bytes in size as requested. In reality, the heap manager can allocate the blocks wherever it wants in the heap so long as the blocks do not overlap and they are at least the requested size. At any particular moment, some areas in the heap have been allocated to the program, and so are in use. Other areas have yet to be committed and so are free and are available to satisfy allocation requests. The heap manager has its own, private data structures to record what areas of the heap are committed to what purpose at any moment The heap manager satisfies each allocation request from the pool of free memory and updates its private data structures to record which areas of the heap are in use.
When the program is finished using a block of memory, it makes an explicit deallocation request to indicate to the heap manager that the program is now finished with that block.
The heap manager updates its private data structures to show that the area of memory occupied by the block is free again and so may be reused to satisfy future allocation requests.
After the deallocation, the pointer continues to point to the now deallocated block. The program must not access the deallocated object. The pointer is there, but it must not be used. Sometimes the code will set the pointer to NULL immediately after the deallocation to make explicit the fact that it is no longer valid.
- The heap is an area of memory available to allocate blocks of memory for the program.
- There is some heap manager library code which manages the heap for the program. The programmer makes requests to the heap manager, which in turn manages the internals of the heap. In C, the heap is managed by the ANSI library functions malloc(), free(), and realloc().
- The heap manager uses its own private data structures to keep track of which blocks in the heap are free and which blocks are currently in use by the program and how large those blocks are. Initially, all of the heap is free.
- The heap may be of a fixed size, or it may appear to be of a fixed but extremely large size backed by virtual memory. In either case, it is possible for the heap to get full if all of its memory has been allocated and so it cannot satisfy an allocation request. The allocation function will communicate this runtime condition in some way to the program - usually by returning a NULL pointer or raising a language specific runtime exception.
- The allocation function requests a block in the heap of a particular size.
The heap manager selects an area of memory to use to satisfy the request,
marks that area as in use in its private data structures, and returns a
pointer to the heap block.
The caller is now free to use that memory by dereferencing the pointer. The block is guaranteed to be reserved for the sole use of the caller - the heap will not hand out that same area of memory to some other caller. The block does not move around inside the heap - its location and size are fixed once it is allocated.
Generally, when a block is allocated, its contents are random. The new owner is responsible for setting the memory to something meaningful. Sometimes there is variation on the memory allocation function which sets the block to all zeros (calloc() in C). - The deallocation function is the opposite of the allocation function. The program makes a single deallocation call to return a block of memory to the heap free area for later reuse. Each block should only be deallocated once. The deallocation function takes as its argument a pointer to a heap block previously furnished by the allocation function. The pointer must be exactly the same pointer returned earlier by the allocation function, not just any pointer into the block. After the deallocation, the program must treat the pointer as bad and not access the deallocated object.
Let's look at the following example and see how the local variables are set in the stack area:
void f1() { int a; short b[4]; double c; f2(); f3(); }
When we call f1() function, space needs to be set aside. The following picture shows activation records/stack frame:
When the function f1() is called, the stack pointer will be decremented by 20 bytes which is the size of the variables of f1().
void f2() { int x; char *y; char *z[2]; f3(); } void f3() { double m[3]; int n; }
The following diagram shows the movement of stack pointer, sp: f1()->f1.f2()->f2.f3()->f2.f3() exits->f1.f2() exits.
After the f1.f2(), when we call f1.f3(), the variables of the function f3() simply overwrites them onto the area where the variables of f2() were.
This section introduces assembly, not the real one but the conceptual assembly just to learn how the stack is manipulated.
So, in this section, we're not going to use real names of registers such as Accumulator (EAX), Counter (ECX), Data (EDX), Base (EBX), Stack Pointer (ESP), Base Pointer (EBP), Source Index (ESI), and Destination Index (EDI) registers. But sometimes we may use Stack Pointer (ESP) or Instruction Pointer (EIP) registers, if necessary.
All the information in the code segment corresponds to the assembly code that was compiled to from our C/C++.
We're going to assume our processors have 32 of register set with 4-byte figure that has really fast access to. Registers themselves are electronically connected to the RAM. Every register draw/flush from/to RAM.
ALU does basic arithmetic (addition, subtraction, multiplication, division, shifting, and masking). In the picture, we're assuming that ALU is connected only to register not to RAM directly. That means that all the meaningful mathematical calculations have to actually be done in register. We have to take something from the memory and loaded into the register in order to add something to it. They can optimize just the load and store between register and RAM. So, any mathematical operations such as i++ or i+1 is to load the variables wherever they are either from heap or stack into registers and do the math and put the results into some other registers and flush them out to where belongs in memory.
Suppose we have:
i = 17; j = 20; j += i;
i = 17 is loaded into R1 and j = 10 is loaded into R2. j+i is compiles into assembly code. Assembly code is the recipe that knows how to load i and j into register set and do the addition and the flush the result back out to the same space that j occupies. What probably happen is that i will be loaded into R1 and j will be loaded into R2. We can add 17 to 20 and store result into R3 because R1 and R2 are connected to ALU which is capable of addition for us, and after we synthesize the result at R3. We can flush it back out to j. Note that the whole process is composed of 4 steps:
- load i into register
- load j into register
- do the addition
- store the result into j
Let's write the code in mock assembly for the following code:
int i; int j; i = 10; j = i + 7; j++;
The compiles into the following mock assembly:
// i = 10; M[R1+4] = 10; // store operation // j = i + 1; R2 = M[R1+4]; // load operation R3 = R2 + 7; // ALU operation M[R1] = R3; // store operation // j++; R2 = M[R1] R2 = R2 + 1; M[R1] = R2;
R1 is the special dedicated register which stores the base address of the variables. M stands for the RAM and the value of 10 is stored 4 byte offset from the base address as M[R1+4] and then loaded into R2. After the addition of 7 at ALU, the result will be stored in the register R3 and flushed into the memory space with the base address.
Let's do an example that does not always deal with 4-byte quantities.
int i; short s1; short s2;
So, the activation record looks like this:
i = 200; => [R1+4] = 200; s1 = i; => R2 = M[R1+4]; M[R1+2] = .2 R2; // .2 two-byte s2 = s1 + 1; => R2 = .2 M[R1+1]; R3 = R2 + 1; M[R1] = .2 R3;
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization