diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 00000000..86ecfb4c --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,158 @@ +The concept of linking is very simple: a compiler compiles a piece of +source code into an object file (a file containing machine code), and +a linker combines object files into a single executable or a shared +library file. However, the actual implementation of the linker for +modern systems is much more complicated because hardware, operating +system, compiler and linker all have many more features. + +In this file, I'll explain random topics that you need to understand +to read mold code in the glossary format. + +- DSO + + A .so file. Short for Dynamic Shared Object. Often called as a + shared library, a dynamic libray or a shared object as well. + + An DSO contains common functions and data that are used by multiple + executables and/or other DSOs. At runtime, a DSO is loaded to a + contiguous region in the virtual address. + +- Object file + + A .o file. An object file contains machine code and data, but it + cannot be executed because it's not self-contained. For example, + if you compile a C source file containing a call of `printf`, + the actual function code of `printf` is not included in the resulting + object file. You include `stdio.h`, but that teaches the compiler + only about `printf`'s type, and the compiler still don't know what + `printf` actually does. Therefore, it cannot emit code for `printf`. + + You need to link an object file with other object file or a shared + library to make it exectuable. + +- Virtual address space + + A pointer has a value like 0x803020 which is an address of the + pointee. But it doesn't mean that the pointee resides at the + physical memory address 0x803020 on the computer. Modern CPUs + contains so-called Mmeory Management Unit (MMU), and all access to + the memory are first translated by MMU to the physical address. + The address before translation is called the "virtual address". + Unless you are doing the kernel programming, all addresses you + handle are virtual addresses. + + The OS kernel controls the MMU so that each process owns the entire + virtual address space. So, even if two process uses the same virtual + address, they don't conflict. They are mapped to different physical + addresses. + + The existence of MMU has several implications to the linker. First, + we can link the main executable to a specific address. On process + startup, there's no code or data in the virtual address space, so + the mapping of the main executable always succeed. However, it's not + true to DSOs because they are loaded after the main executable and + possibly other DSOs. Therefore, shared libraries must be linked in a + way that they can be loaded to any address in the virtual address + space. + +- Relocation + + A piece of information for the linker as to how to link object files + or a dynamic objects. + + Object files can refer functions or data in other object files. For + example, if you compile a function which calls a non-local function + `foo`, the resulting code contains something like this: + + ``` + 26: e8 00 00 00 00 callq 2b + 27: R_X86_64_PLT32 foo-0x4 + ``` + + The above `callq` is the instruction to call a function at the + machine code level. It's opcode is `0xe8` in x86-64, so the + instruction begins with `0xe8`. The following four bytes are + displacement; that is, the address of the branch target relative to + the end of this `callq` instruction. Notice that the displacement is + 0. The compiler couldn't fill the displacement because it has no + idea as to where `foo` will be at runtime. So, the compiler write 0 + as a placeholder and instead write a relocation `R_X86_64_PLT32` + with `foo` as its associated symbol. The linker reads this + relocation, computes the offsets between this call instruction and + function `foo` and overwrite the placeholder value 0 with an actual + displacement. + + There are many different types of relocations. For example, if you + want to fix up not with a displacement but with an absolute address + of a symbol, you need to use `R_X86_64_ABS64` instead. + +- Static library + + A .a file. Often called as an archive file or just archive as well. + + A static library is a container just like tar or zip. Actually, + there's no technical reason to not use tar or (uncompressed) zip, + but traditionally the .a file format is used by the linker. + + A static library contains object files and can be passed to the + linker along with other object files and/or archives. + + A linker pulls out object files from an archive only if it is needed + to resolve undefined symbols. In other words, object files in an + archive are not linked by default and used as a complement to supply + missing definitions. This is ideal for a library because you don't + want to link library functions unless you are actually using them. + + Contrary to archive files, object files directly given to a linker + are always linked to the output. + + To maximize the benefit of archive files, a library often used as a + static library is broken down to small files to separate each + function individually (for example, look at + https://git.musl-libc.org/cgit/musl/tree/src/stdio). By doing this, + you import only used functions. + + A static file is created by `ar`, whose command line arguments are + similar to `tar`. A static library contains the symbol table which + offers a quick way to look up an object file for a defined symbol, + but mold does not use the static library's symbol table. mold + doesdn't need a symbol table to exist in an archive, and if exists, + mold just ignores it. + + See also: DSO (dynamic library) + +- Symbol + + A symbol is a label assigned to a specific location in an input file + or an output file. For example, if you define function `foo` and + compile it, the resulting object file contains a symbol `foo` + pointing to the beginning of the machine code for `foo`. + + Usually, a symbol name is a function or a variable name. If an + object is anonymous (such the one for a string literal), a compiler + generated a unique symbol, which often starts with `.` to avoid + conflict with user-defined symbols. + + For C++, symbol name is a complex "mangled" name. We need to mangle + identifiers because a simple name such as `foo` cannot be uniquely + identify a function or a data in C++, because for example `foo` may + be in a namespace or defined as a static member in some class. If + `foo` is an overloaded function, we need to distinguish different + `foo`s by its type. Therefore, C++ compiler mangles an identifier by + appending nmaepsace names, type information and such so that + different things get different names. + + For example, a function `int foo(int)` in a namespace `bar` is + mangled as `_ZN3bar3fooEi`. + + A symbol can be either defined or undefined. A defined symbol points + to some location in a file which may contain the function's machine + code or the variable's initial value. An undefined symbol does not + point to anywhere. It needs to be merged with a defined symbol with + the same name at link-time. This merging process is called "name + resolution". + + For example, if your program is using `printf`, it usually contains + `printf` as an undefined symbol. You need to link it with `libc.a` + or `libc.so`, which contain a defined symbol of `printf`, to make a + complete program.