1
1
mirror of https://github.com/rui314/mold.git synced 2024-08-17 17:00:30 +03:00

Update a document

This commit is contained in:
Rui Ueyama 2022-02-22 19:26:08 +09:00
parent 385c1c494d
commit fb6967c0cf

View File

@ -1,158 +1,158 @@
The concept of linking is very simple: a compiler compiles a piece of The very concept of linking is simple: a compiler compiles a piece of
source code into an object file (a file containing machine code), and source code into an object file (a file containing machine code), and
a linker combines object files into a single executable or a shared a linker combines object files into a single executable or a shared
library file. However, the actual implementation of the linker for library file. However, the actual implementation of the linker for
modern systems is much more complicated because hardware, operating modern systems is much more complicated because hardware, operating
system, compiler and linker all have many more features. system, compiler and linker all have many more features.
In this file, I'll explain random topics that you need to understand In this file, I'll explain random topics in the glossary format that
to read mold code in the glossary format. you need to understand to read mold code.
- DSO ## DSO
A .so file. Short for Dynamic Shared Object. Often called as a A .so file. Short for Dynamic Shared Object. Often called as a
shared library, a dynamic libray or a shared object as well. shared library, a dynamic libray or a shared object as well.
An DSO contains common functions and data that are used by multiple An DSO contains common functions and data that are used by multiple
executables and/or other DSOs. At runtime, a DSO is loaded to a executables and/or other DSOs. At runtime, a DSO is loaded to a
contiguous region in the virtual address. contiguous region in the virtual address.
- Object file ## Object file
A .o file. An object file contains machine code and data, but it A .o file. An object file contains machine code and data, but it
cannot be executed because it's not self-contained. For example, cannot be executed because it's not self-contained. For example,
if you compile a C source file containing a call of `printf`, if you compile a C source file containing a call of `printf`,
the actual function code of `printf` is not included in the resulting the actual function code of `printf` is not included in the resulting
object file. You include `stdio.h`, but that teaches the compiler object file. You include `stdio.h`, but that teaches the compiler
only about `printf`'s type, and the compiler still don't know what only about `printf`'s type, and the compiler still don't know what
`printf` actually does. Therefore, it cannot emit code for `printf`. `printf` actually does. Therefore, it cannot emit code for `printf`.
You need to link an object file with other object file or a shared You need to link an object file with other object file or a shared
library to make it exectuable. library to make it exectuable.
- Virtual address space ## Virtual address space
A pointer has a value like 0x803020 which is an address of the A pointer has a value like 0x803020 which is an address of the
pointee. But it doesn't mean that the pointee resides at the pointee. But it doesn't mean that the pointee resides at the
physical memory address 0x803020 on the computer. Modern CPUs physical memory address 0x803020 on the computer. Modern CPUs
contains so-called Mmeory Management Unit (MMU), and all access to contains so-called Mmeory Management Unit (MMU), and all access to
the memory are first translated by MMU to the physical address. the memory are first translated by MMU to the physical address.
The address before translation is called the "virtual address". The address before translation is called the "virtual address".
Unless you are doing the kernel programming, all addresses you Unless you are doing the kernel programming, all addresses you
handle are virtual addresses. handle are virtual addresses.
The OS kernel controls the MMU so that each process owns the entire The OS kernel controls the MMU so that each process owns the entire
virtual address space. So, even if two process uses the same virtual virtual address space. So, even if two process uses the same virtual
address, they don't conflict. They are mapped to different physical address, they don't conflict. They are mapped to different physical
addresses. addresses.
The existence of MMU has several implications to the linker. First, The existence of MMU has several implications to the linker. First,
we can link the main executable to a specific address. On process we can link the main executable to a specific address. On process
startup, there's no code or data in the virtual address space, so startup, there's no code or data in the virtual address space, so
the mapping of the main executable always succeed. However, it's not the mapping of the main executable always succeed. However, it's not
true to DSOs because they are loaded after the main executable and true to DSOs because they are loaded after the main executable and
possibly other DSOs. Therefore, shared libraries must be linked in a possibly other DSOs. Therefore, shared libraries must be linked in a
way that they can be loaded to any address in the virtual address way that they can be loaded to any address in the virtual address
space. space.
- Relocation ## Relocation
A piece of information for the linker as to how to link object files A piece of information for the linker as to how to link object files
or a dynamic objects. or a dynamic objects.
Object files can refer functions or data in other object files. For Object files can refer functions or data in other object files. For
example, if you compile a function which calls a non-local function example, if you compile a function which calls a non-local function
`foo`, the resulting code contains something like this: `foo`, the resulting code contains something like this:
``` ```
26: e8 00 00 00 00 callq 2b <bar+0xb> 26: e8 00 00 00 00 callq 2b <bar+0xb>
27: R_X86_64_PLT32 foo-0x4 27: R_X86_64_PLT32 foo-0x4
``` ```
The above `callq` is the instruction to call a function at the The above `callq` is the instruction to call a function at the
machine code level. It's opcode is `0xe8` in x86-64, so the machine code level. It's opcode is `0xe8` in x86-64, so the
instruction begins with `0xe8`. The following four bytes are instruction begins with `0xe8`. The following four bytes are
displacement; that is, the address of the branch target relative to displacement; that is, the address of the branch target relative to
the end of this `callq` instruction. Notice that the displacement is the end of this `callq` instruction. Notice that the displacement is
0. The compiler couldn't fill the displacement because it has no 0. The compiler couldn't fill the displacement because it has no
idea as to where `foo` will be at runtime. So, the compiler write 0 idea as to where `foo` will be at runtime. So, the compiler write 0
as a placeholder and instead write a relocation `R_X86_64_PLT32` as a placeholder and instead write a relocation `R_X86_64_PLT32`
with `foo` as its associated symbol. The linker reads this with `foo` as its associated symbol. The linker reads this
relocation, computes the offsets between this call instruction and relocation, computes the offsets between this call instruction and
function `foo` and overwrite the placeholder value 0 with an actual function `foo` and overwrite the placeholder value 0 with an actual
displacement. displacement.
There are many different types of relocations. For example, if you There are many different types of relocations. For example, if you
want to fix up not with a displacement but with an absolute address want to fix up not with a displacement but with an absolute address
of a symbol, you need to use `R_X86_64_ABS64` instead. of a symbol, you need to use `R_X86_64_ABS64` instead.
- Static library ## Static library
A .a file. Often called as an archive file or just archive as well. A .a file. Often called as an archive file or just archive as well.
A static library is a container just like tar or zip. Actually, A static library is a container just like tar or zip. Actually,
there's no technical reason to not use tar or (uncompressed) zip, there's no technical reason to not use tar or (uncompressed) zip,
but traditionally the .a file format is used by the linker. but traditionally the .a file format is used by the linker.
A static library contains object files and can be passed to the A static library contains object files and can be passed to the
linker along with other object files and/or archives. linker along with other object files and/or archives.
A linker pulls out object files from an archive only if it is needed A linker pulls out object files from an archive only if it is needed
to resolve undefined symbols. In other words, object files in an to resolve undefined symbols. In other words, object files in an
archive are not linked by default and used as a complement to supply archive are not linked by default and used as a complement to supply
missing definitions. This is ideal for a library because you don't missing definitions. This is ideal for a library because you don't
want to link library functions unless you are actually using them. want to link library functions unless you are actually using them.
Contrary to archive files, object files directly given to a linker Contrary to archive files, object files directly given to a linker
are always linked to the output. are always linked to the output.
To maximize the benefit of archive files, a library often used as a To maximize the benefit of archive files, a library often used as a
static library is broken down to small files to separate each static library is broken down to small files to separate each
function individually (for example, look at function individually (for example, look at
https://git.musl-libc.org/cgit/musl/tree/src/stdio). By doing this, https://git.musl-libc.org/cgit/musl/tree/src/stdio). By doing this,
you import only used functions. you import only used functions.
A static file is created by `ar`, whose command line arguments are A static file is created by `ar`, whose command line arguments are
similar to `tar`. A static library contains the symbol table which similar to `tar`. A static library contains the symbol table which
offers a quick way to look up an object file for a defined symbol, offers a quick way to look up an object file for a defined symbol,
but mold does not use the static library's symbol table. mold but mold does not use the static library's symbol table. mold
doesdn't need a symbol table to exist in an archive, and if exists, doesdn't need a symbol table to exist in an archive, and if exists,
mold just ignores it. mold just ignores it.
See also: DSO (dynamic library) See also: DSO (dynamic library)
- Symbol ## Symbol
A symbol is a label assigned to a specific location in an input file A symbol is a label assigned to a specific location in an input file
or an output file. For example, if you define function `foo` and or an output file. For example, if you define function `foo` and
compile it, the resulting object file contains a symbol `foo` compile it, the resulting object file contains a symbol `foo`
pointing to the beginning of the machine code for `foo`. pointing to the beginning of the machine code for `foo`.
Usually, a symbol name is a function or a variable name. If an Usually, a symbol name is a function or a variable name. If an
object is anonymous (such the one for a string literal), a compiler object is anonymous (such the one for a string literal), a compiler
generated a unique symbol, which often starts with `.` to avoid generated a unique symbol, which often starts with `.` to avoid
conflict with user-defined symbols. conflict with user-defined symbols.
For C++, symbol name is a complex "mangled" name. We need to mangle For C++, symbol name is a complex "mangled" name. We need to mangle
identifiers because a simple name such as `foo` cannot be uniquely identifiers because a simple name such as `foo` cannot be uniquely
identify a function or a data in C++, because for example `foo` may identify a function or a data in C++, because for example `foo` may
be in a namespace or defined as a static member in some class. If be in a namespace or defined as a static member in some class. If
`foo` is an overloaded function, we need to distinguish different `foo` is an overloaded function, we need to distinguish different
`foo`s by its type. Therefore, C++ compiler mangles an identifier by `foo`s by its type. Therefore, C++ compiler mangles an identifier by
appending nmaepsace names, type information and such so that appending nmaepsace names, type information and such so that
different things get different names. different things get different names.
For example, a function `int foo(int)` in a namespace `bar` is For example, a function `int foo(int)` in a namespace `bar` is
mangled as `_ZN3bar3fooEi`. mangled as `_ZN3bar3fooEi`.
A symbol can be either defined or undefined. A defined symbol points A symbol can be either defined or undefined. A defined symbol points
to some location in a file which may contain the function's machine to some location in a file which may contain the function's machine
code or the variable's initial value. An undefined symbol does not code or the variable's initial value. An undefined symbol does not
point to anywhere. It needs to be merged with a defined symbol with point to anywhere. It needs to be merged with a defined symbol with
the same name at link-time. This merging process is called "name the same name at link-time. This merging process is called "name
resolution". resolution".
For example, if your program is using `printf`, it usually contains For example, if your program is using `printf`, it usually contains
`printf` as an undefined symbol. You need to link it with `libc.a` `printf` as an undefined symbol. You need to link it with `libc.a`
or `libc.so`, which contain a defined symbol of `printf`, to make a or `libc.so`, which contain a defined symbol of `printf`, to make a
complete program. complete program.