1
1
mirror of https://github.com/rui314/mold.git synced 2024-07-15 00:30:25 +03:00

Update a document

This commit is contained in:
Rui Ueyama 2022-02-22 19:26:08 +09:00
parent 385c1c494d
commit fb6967c0cf

View File

@ -1,158 +1,158 @@
The concept of linking is very simple: a compiler compiles a piece of
The very concept of linking is simple: a compiler compiles a piece of
source code into an object file (a file containing machine code), and
a linker combines object files into a single executable or a shared
library file. However, the actual implementation of the linker for
modern systems is much more complicated because hardware, operating
system, compiler and linker all have many more features.
In this file, I'll explain random topics that you need to understand
to read mold code in the glossary format.
In this file, I'll explain random topics in the glossary format that
you need to understand to read mold code.
- DSO
## DSO
A .so file. Short for Dynamic Shared Object. Often called as a
shared library, a dynamic libray or a shared object as well.
A .so file. Short for Dynamic Shared Object. Often called as a
shared library, a dynamic libray or a shared object as well.
An DSO contains common functions and data that are used by multiple
executables and/or other DSOs. At runtime, a DSO is loaded to a
contiguous region in the virtual address.
An DSO contains common functions and data that are used by multiple
executables and/or other DSOs. At runtime, a DSO is loaded to a
contiguous region in the virtual address.
- Object file
## Object file
A .o file. An object file contains machine code and data, but it
cannot be executed because it's not self-contained. For example,
if you compile a C source file containing a call of `printf`,
the actual function code of `printf` is not included in the resulting
object file. You include `stdio.h`, but that teaches the compiler
only about `printf`'s type, and the compiler still don't know what
`printf` actually does. Therefore, it cannot emit code for `printf`.
A .o file. An object file contains machine code and data, but it
cannot be executed because it's not self-contained. For example,
if you compile a C source file containing a call of `printf`,
the actual function code of `printf` is not included in the resulting
object file. You include `stdio.h`, but that teaches the compiler
only about `printf`'s type, and the compiler still don't know what
`printf` actually does. Therefore, it cannot emit code for `printf`.
You need to link an object file with other object file or a shared
library to make it exectuable.
You need to link an object file with other object file or a shared
library to make it exectuable.
- Virtual address space
## Virtual address space
A pointer has a value like 0x803020 which is an address of the
pointee. But it doesn't mean that the pointee resides at the
physical memory address 0x803020 on the computer. Modern CPUs
contains so-called Mmeory Management Unit (MMU), and all access to
the memory are first translated by MMU to the physical address.
The address before translation is called the "virtual address".
Unless you are doing the kernel programming, all addresses you
handle are virtual addresses.
A pointer has a value like 0x803020 which is an address of the
pointee. But it doesn't mean that the pointee resides at the
physical memory address 0x803020 on the computer. Modern CPUs
contains so-called Mmeory Management Unit (MMU), and all access to
the memory are first translated by MMU to the physical address.
The address before translation is called the "virtual address".
Unless you are doing the kernel programming, all addresses you
handle are virtual addresses.
The OS kernel controls the MMU so that each process owns the entire
virtual address space. So, even if two process uses the same virtual
address, they don't conflict. They are mapped to different physical
addresses.
The OS kernel controls the MMU so that each process owns the entire
virtual address space. So, even if two process uses the same virtual
address, they don't conflict. They are mapped to different physical
addresses.
The existence of MMU has several implications to the linker. First,
we can link the main executable to a specific address. On process
startup, there's no code or data in the virtual address space, so
the mapping of the main executable always succeed. However, it's not
true to DSOs because they are loaded after the main executable and
possibly other DSOs. Therefore, shared libraries must be linked in a
way that they can be loaded to any address in the virtual address
space.
The existence of MMU has several implications to the linker. First,
we can link the main executable to a specific address. On process
startup, there's no code or data in the virtual address space, so
the mapping of the main executable always succeed. However, it's not
true to DSOs because they are loaded after the main executable and
possibly other DSOs. Therefore, shared libraries must be linked in a
way that they can be loaded to any address in the virtual address
space.
- Relocation
## Relocation
A piece of information for the linker as to how to link object files
or a dynamic objects.
A piece of information for the linker as to how to link object files
or a dynamic objects.
Object files can refer functions or data in other object files. For
example, if you compile a function which calls a non-local function
`foo`, the resulting code contains something like this:
Object files can refer functions or data in other object files. For
example, if you compile a function which calls a non-local function
`foo`, the resulting code contains something like this:
```
26: e8 00 00 00 00 callq 2b <bar+0xb>
27: R_X86_64_PLT32 foo-0x4
```
```
26: e8 00 00 00 00 callq 2b <bar+0xb>
27: R_X86_64_PLT32 foo-0x4
```
The above `callq` is the instruction to call a function at the
machine code level. It's opcode is `0xe8` in x86-64, so the
instruction begins with `0xe8`. The following four bytes are
displacement; that is, the address of the branch target relative to
the end of this `callq` instruction. Notice that the displacement is
0. The compiler couldn't fill the displacement because it has no
idea as to where `foo` will be at runtime. So, the compiler write 0
as a placeholder and instead write a relocation `R_X86_64_PLT32`
with `foo` as its associated symbol. The linker reads this
relocation, computes the offsets between this call instruction and
function `foo` and overwrite the placeholder value 0 with an actual
displacement.
The above `callq` is the instruction to call a function at the
machine code level. It's opcode is `0xe8` in x86-64, so the
instruction begins with `0xe8`. The following four bytes are
displacement; that is, the address of the branch target relative to
the end of this `callq` instruction. Notice that the displacement is
0. The compiler couldn't fill the displacement because it has no
idea as to where `foo` will be at runtime. So, the compiler write 0
as a placeholder and instead write a relocation `R_X86_64_PLT32`
with `foo` as its associated symbol. The linker reads this
relocation, computes the offsets between this call instruction and
function `foo` and overwrite the placeholder value 0 with an actual
displacement.
There are many different types of relocations. For example, if you
want to fix up not with a displacement but with an absolute address
of a symbol, you need to use `R_X86_64_ABS64` instead.
There are many different types of relocations. For example, if you
want to fix up not with a displacement but with an absolute address
of a symbol, you need to use `R_X86_64_ABS64` instead.
- Static library
## Static library
A .a file. Often called as an archive file or just archive as well.
A .a file. Often called as an archive file or just archive as well.
A static library is a container just like tar or zip. Actually,
there's no technical reason to not use tar or (uncompressed) zip,
but traditionally the .a file format is used by the linker.
A static library is a container just like tar or zip. Actually,
there's no technical reason to not use tar or (uncompressed) zip,
but traditionally the .a file format is used by the linker.
A static library contains object files and can be passed to the
linker along with other object files and/or archives.
A static library contains object files and can be passed to the
linker along with other object files and/or archives.
A linker pulls out object files from an archive only if it is needed
to resolve undefined symbols. In other words, object files in an
archive are not linked by default and used as a complement to supply
missing definitions. This is ideal for a library because you don't
want to link library functions unless you are actually using them.
A linker pulls out object files from an archive only if it is needed
to resolve undefined symbols. In other words, object files in an
archive are not linked by default and used as a complement to supply
missing definitions. This is ideal for a library because you don't
want to link library functions unless you are actually using them.
Contrary to archive files, object files directly given to a linker
are always linked to the output.
Contrary to archive files, object files directly given to a linker
are always linked to the output.
To maximize the benefit of archive files, a library often used as a
static library is broken down to small files to separate each
function individually (for example, look at
https://git.musl-libc.org/cgit/musl/tree/src/stdio). By doing this,
you import only used functions.
To maximize the benefit of archive files, a library often used as a
static library is broken down to small files to separate each
function individually (for example, look at
https://git.musl-libc.org/cgit/musl/tree/src/stdio). By doing this,
you import only used functions.
A static file is created by `ar`, whose command line arguments are
similar to `tar`. A static library contains the symbol table which
offers a quick way to look up an object file for a defined symbol,
but mold does not use the static library's symbol table. mold
doesdn't need a symbol table to exist in an archive, and if exists,
mold just ignores it.
A static file is created by `ar`, whose command line arguments are
similar to `tar`. A static library contains the symbol table which
offers a quick way to look up an object file for a defined symbol,
but mold does not use the static library's symbol table. mold
doesdn't need a symbol table to exist in an archive, and if exists,
mold just ignores it.
See also: DSO (dynamic library)
See also: DSO (dynamic library)
- Symbol
## Symbol
A symbol is a label assigned to a specific location in an input file
or an output file. For example, if you define function `foo` and
compile it, the resulting object file contains a symbol `foo`
pointing to the beginning of the machine code for `foo`.
A symbol is a label assigned to a specific location in an input file
or an output file. For example, if you define function `foo` and
compile it, the resulting object file contains a symbol `foo`
pointing to the beginning of the machine code for `foo`.
Usually, a symbol name is a function or a variable name. If an
object is anonymous (such the one for a string literal), a compiler
generated a unique symbol, which often starts with `.` to avoid
conflict with user-defined symbols.
Usually, a symbol name is a function or a variable name. If an
object is anonymous (such the one for a string literal), a compiler
generated a unique symbol, which often starts with `.` to avoid
conflict with user-defined symbols.
For C++, symbol name is a complex "mangled" name. We need to mangle
identifiers because a simple name such as `foo` cannot be uniquely
identify a function or a data in C++, because for example `foo` may
be in a namespace or defined as a static member in some class. If
`foo` is an overloaded function, we need to distinguish different
`foo`s by its type. Therefore, C++ compiler mangles an identifier by
appending nmaepsace names, type information and such so that
different things get different names.
For C++, symbol name is a complex "mangled" name. We need to mangle
identifiers because a simple name such as `foo` cannot be uniquely
identify a function or a data in C++, because for example `foo` may
be in a namespace or defined as a static member in some class. If
`foo` is an overloaded function, we need to distinguish different
`foo`s by its type. Therefore, C++ compiler mangles an identifier by
appending nmaepsace names, type information and such so that
different things get different names.
For example, a function `int foo(int)` in a namespace `bar` is
mangled as `_ZN3bar3fooEi`.
For example, a function `int foo(int)` in a namespace `bar` is
mangled as `_ZN3bar3fooEi`.
A symbol can be either defined or undefined. A defined symbol points
to some location in a file which may contain the function's machine
code or the variable's initial value. An undefined symbol does not
point to anywhere. It needs to be merged with a defined symbol with
the same name at link-time. This merging process is called "name
resolution".
A symbol can be either defined or undefined. A defined symbol points
to some location in a file which may contain the function's machine
code or the variable's initial value. An undefined symbol does not
point to anywhere. It needs to be merged with a defined symbol with
the same name at link-time. This merging process is called "name
resolution".
For example, if your program is using `printf`, it usually contains
`printf` as an undefined symbol. You need to link it with `libc.a`
or `libc.so`, which contain a defined symbol of `printf`, to make a
complete program.
For example, if your program is using `printf`, it usually contains
`printf` as an undefined symbol. You need to link it with `libc.a`
or `libc.so`, which contain a defined symbol of `printf`, to make a
complete program.