Breaking

Friday, May 21, 2021

A binary file is an executable file containing only "0" and "1". In this series of posts, we are going to analyze the binary file. Every programmer should know how to create binary. Of course, it is not the case that we open an editor, write 1 and 0, and try to execute the file. We need a compiler like gcc, clang, msvc etc. to create a binary file. The compiler creates a binary file from the source files that are written in any programming language. Before we delve deeper into the binary, let's recall the creation and its loading process in memory for the execution.

Compilation Process

Four stages of the compilation process start with the preprocessing phase by expanding all #define and #include directives in the source file. After the preprocessing phase is complete, all we get inside is pure code that is ready to be compiled. The compilation phase takes the preprocessed code and translates it into assembly language. (Most compilers also perform heavy optimization in this phase, typically configurable as an optimization level through command line switches such as options -O0 through -O3 in gcc. In the assembly phase, we finally get to generate some real machine code! The input of the assembly phase is the set of assembly language files generated in the compilation phase, and the output is a set of object files. We can tell the compiler to stop the compilation process after a particular phase, by providing command-line switches like -E, -S, and -c for the preprocessed, compiled, and assembled file, respectively.

The term relocatable in the file output tells that you’re dealing with an object file and not with an executable. Object files are compiled independently from each other, so the assembler has no way of knowing the memory addresses of other object files when assembling an object file. That’s why object files need to be relocatable. So we can link them together in any order to make a complete binary executable. If object files were not relocatable, this won't be possible. 


Most compilers, including gcc, automatically call the linker at the end of the compilation process. The linker’s job is to take all the object files belonging to a program and merge them into a single coherent executable, typically intended to be loaded at a particular memory address. Now that the arrangement of all modules in the executable is known, the linker can also resolve most symbolic references. References to libraries may or may not be completely resolved, depending on the type of library.  


Static libraries are merged into the binary executable, allowing any references to them to be resolved entirely. There are also dynamic (shared) libraries, which are shared in memory among all programs that run on a system. In other words, rather than copying the library into every binary that uses it, dynamic libraries are loaded into memory only once, and any binary that wants to use the library needs to use this shared copy.

During the linking phase, the addresses at which dynamic libraries will reside are not yet known, so references to them cannot be resolved. Instead, the linker leaves symbolic references to these libraries even in the final executable, and these references are not resolved until the binary is actually loaded into memory to execute.



When you decide to run a binary, the operating system starts by setting up a new process for the program to run in, including a virtual address space.7 Subsequently, the operating system maps an interpreter into the process’s virtual memory. This is a user space program that knows how to load the binary and perform the necessary relocations. On Linux, the interpreter is typically a shared library called ld-linux.so. On Windows, the interpreter functionality is implemented as part of ntdll.dll. After loading the interpreter, the kernel transfers control to it, and the interpreter begins its work in userspace. 


The interpreter loads the binary into its virtual address space (the same space in which the interpreter is loaded). It then parses the binary to find out (among other things) which dynamic libraries the binary uses. The interpreter maps these into the virtual address space (using mmap or an equivalent function) and then performs any necessary last-minute relocations in the binary’s code sections to fill in the correct addresses for references to the dynamic libraries. In reality, the process of resolving references to functions in dynamic libraries is often deferred until later. In other words, instead of resolving these references immediately at load time, the interpreter resolves references only when they are invoked for the first time. This is known as lazy binding. After relocation is complete, the interpreter looks up the entry point of the binary and transfers control to it, beginning normal execution of the binary.


In the next part, we will see what exactly we find inside the executable file.

close