1
1
mirror of https://github.com/rui314/mold.git synced 2024-09-19 08:57:39 +03:00
mold/docs/glossary.md
2022-02-22 19:26:28 +09:00

6.9 KiB

The very concept of linking is simple: a compiler compiles a piece of source code into an object file (a file containing machine code), and a linker combines object files into a single executable or a shared library file. However, the actual implementation of the linker for modern systems is much more complicated because hardware, operating system, compiler and linker all have many more features.

In this file, I'll explain random topics in the glossary format that you need to understand to read mold code.

DSO

A .so file. Short for Dynamic Shared Object. Often called as a shared library, a dynamic libray or a shared object as well.

An DSO contains common functions and data that are used by multiple executables and/or other DSOs. At runtime, a DSO is loaded to a contiguous region in the virtual address.

Object file

A .o file. An object file contains machine code and data, but it cannot be executed because it's not self-contained. For example, if you compile a C source file containing a call of printf, the actual function code of printf is not included in the resulting object file. You include stdio.h, but that teaches the compiler only about printf's type, and the compiler still don't know what printf actually does. Therefore, it cannot emit code for printf.

You need to link an object file with other object file or a shared library to make it exectuable.

Virtual address space

A pointer has a value like 0x803020 which is an address of the pointee. But it doesn't mean that the pointee resides at the physical memory address 0x803020 on the computer. Modern CPUs contains so-called Mmeory Management Unit (MMU), and all access to the memory are first translated by MMU to the physical address. The address before translation is called the "virtual address". Unless you are doing the kernel programming, all addresses you handle are virtual addresses.

The OS kernel controls the MMU so that each process owns the entire virtual address space. So, even if two process uses the same virtual address, they don't conflict. They are mapped to different physical addresses.

The existence of MMU has several implications to the linker. First, we can link the main executable to a specific address. On process startup, there's no code or data in the virtual address space, so the mapping of the main executable always succeed. However, it's not true to DSOs because they are loaded after the main executable and possibly other DSOs. Therefore, shared libraries must be linked in a way that they can be loaded to any address in the virtual address space.

Relocation

A piece of information for the linker as to how to link object files or a dynamic objects.

Object files can refer functions or data in other object files. For example, if you compile a function which calls a non-local function foo, the resulting code contains something like this:

  26:   e8 00 00 00 00          callq  2b <bar+0xb>
                        27: R_X86_64_PLT32      foo-0x4

The above callq is the instruction to call a function at the machine code level. It's opcode is 0xe8 in x86-64, so the instruction begins with 0xe8. The following four bytes are displacement; that is, the address of the branch target relative to the end of this callq instruction. Notice that the displacement is 0. The compiler couldn't fill the displacement because it has no idea as to where foo will be at runtime. So, the compiler write 0 as a placeholder and instead write a relocation R_X86_64_PLT32 with foo as its associated symbol. The linker reads this relocation, computes the offsets between this call instruction and function foo and overwrite the placeholder value 0 with an actual displacement.

There are many different types of relocations. For example, if you want to fix up not with a displacement but with an absolute address of a symbol, you need to use R_X86_64_ABS64 instead.

Static library

A .a file. Often called as an archive file or just archive as well.

A static library is a container just like tar or zip. Actually, there's no technical reason to not use tar or (uncompressed) zip, but traditionally the .a file format is used by the linker.

A static library contains object files and can be passed to the linker along with other object files and/or archives.

A linker pulls out object files from an archive only if it is needed to resolve undefined symbols. In other words, object files in an archive are not linked by default and used as a complement to supply missing definitions. This is ideal for a library because you don't want to link library functions unless you are actually using them.

Contrary to archive files, object files directly given to a linker are always linked to the output.

To maximize the benefit of archive files, a library often used as a static library is broken down to small files to separate each function individually (for example, look at https://git.musl-libc.org/cgit/musl/tree/src/stdio). By doing this, you import only used functions.

A static file is created by ar, whose command line arguments are similar to tar. A static library contains the symbol table which offers a quick way to look up an object file for a defined symbol, but mold does not use the static library's symbol table. mold doesdn't need a symbol table to exist in an archive, and if exists, mold just ignores it.

See also: DSO (dynamic library)

Symbol

A symbol is a label assigned to a specific location in an input file or an output file. For example, if you define function foo and compile it, the resulting object file contains a symbol foo pointing to the beginning of the machine code for foo.

Usually, a symbol name is a function or a variable name. If an object is anonymous (such the one for a string literal), a compiler generated a unique symbol, which often starts with . to avoid conflict with user-defined symbols.

For C++, symbol name is a complex "mangled" name. We need to mangle identifiers because a simple name such as foo cannot be uniquely identify a function or a data in C++, because for example foo may be in a namespace or defined as a static member in some class. If foo is an overloaded function, we need to distinguish different foos by its type. Therefore, C++ compiler mangles an identifier by appending nmaepsace names, type information and such so that different things get different names.

For example, a function int foo(int) in a namespace bar is mangled as _ZN3bar3fooEi.

A symbol can be either defined or undefined. A defined symbol points to some location in a file which may contain the function's machine code or the variable's initial value. An undefined symbol does not point to anywhere. It needs to be merged with a defined symbol with the same name at link-time. This merging process is called "name resolution".

For example, if your program is using printf, it usually contains printf as an undefined symbol. You need to link it with libc.a or libc.so, which contain a defined symbol of printf, to make a complete program.