1
1
mirror of https://github.com/rui314/mold.git synced 2024-12-28 19:04:27 +03:00
mold/README.md
Rui Ueyama 907dd6a341 wip
2021-03-26 22:14:23 +09:00

617 lines
29 KiB
Markdown

# mold: A Modern Linker
![mold image](mold.jpg)
This is a repository of a linker I'm currently developing as a
replacement for existing Unix linkers such as GNU BFD, GNU gold or
LLVM lld.
My goal was to make a linker that is as fast as concatenating input
object files with `cat` command. It may sound like an impossible goal,
but it's not entirely impossible because of the following two reasons:
1. `cat` is a simple single-threaded program which isn't the fastest
one as a file copy command. My linker can use multiple threads to
copy file contents more efficiently to save time to do extra work.
2. Copying file contents is I/O-bounded, and many CPU cores should be
available during file copy. We can use them to do extra work while
copying file contents.
Concretely speaking, I wanted to use the linker to link a Chromium
executable with full debug info (~2 GiB in size) just in 1 second.
LLVM's lld, the fastest open-source linker which I originally created
a few years ago, takes about 12 seconds to link Chromium on my machine.
So the goal is 12x performance bump over lld. Compared to GNU gold,
it's more than 50x.
It looks like mold has achieved the goal. It can link Chromium in 2
seconds with 8-cores/16-threads, and if I enable the preloading
feature (I'll explain it later), the latency of the linker for an
interactive use is less than 900 milliseconds. It is actualy faster
than `cat`.
Note that even though mold can create a runnable Chrome executable,
it is far from complete and not usable for production. mold is still
just a toy linker, and this is still just my pet project.
## Performance visualization
Here is a side-by-side comparison of per-core CPU usage of lld (left)
and mold (right). They are linking the same program, Chromium
executable, which is ~2GB with debug info. As you can see, mold is
much faster than lld. Throughout its execution, mold uses all
available cores, which is artificially capped to 16 cores in this
demo.
![](htop.gif)
## How to build
mold is written in C++20, so you need a very recent version of GCC or
Clang. I'm using Ubuntu 20.04 as a development platform. In that
environment, you can build mold by the following commands.
```
$ sudo apt-get install build-essential libstdc++-10-dev clang-10 libssl-dev zlib1g-dev cmake
$ git clone --recursive https://github.com/rui314/mold.git
$ cd mold
$ make submodules
$ make
```
The last `make` command creates `mold` executable.
## How to use
On Unix, the linker command (which is usually `/usr/bin/ld`) is
invoked indirectly by `cc` (or `gcc` or `clang`), which is typically
in turn indirectly invoked by `make` or some other build system
command. It is sometimes very hard to pass an appropriate command line
option to `cc` to specify an alternative linker.
To deal with the situation, mold has a feature to intercept all
invocations of `/usr/bin/ld` and redirect it to itself. To use the
feature, run `make` (or other build command) as a subprocess of mold
as follows:
```
$ path/to/mold -run make <make-options-if-any>
```
Internally, mold invokes a given command with `LD_PRELOAD` environment
variable set to its companion shared object file. The shared object
file intercepts all function calls to exec-family functions and
`posix_spawn` to replace `argv[0]` with `mold` if it is `/usr/bin/ld`.
Alternatively, you can pass `-fuse-ld=<absolute-path-to-mold-executable>`
to a linker command line. Since GCC doesn't support that option,
I recommend using clang.
mold leaves its identification string in `.comment` section in an output
file. You can print it out to verify that you are actually using mold.
```
$ readelf -p .comment <executable-file>
String dump of section '.comment':
[ 0] GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
[ 2b] mold 9a1679b47d9b22012ec7dfbda97c8983956716f7
```
If `mold` is in `.comment`, the file is created by mold.
## Background
- Even though lld has significantly improved the situation, linking is
still one of the slowest steps in a build. It is especially
annoying when I changed one line of code and had to wait for a few
seconds or even more for a linker to complete. It should be
instantaneous. There's a need for a faster linker.
- The number of cores on a PC has increased a lot lately, and this
trend is expected to continue. However, the existing linkers can't
take the advantage of that because they don't scale well for more
cores. I have a 64-core/128-thread machine, so my goal is to create
a linker that uses the CPU nicely. mold should be much faster than
other linkers on 4 or 8-core machines too, though.
- It looks to me that the designs of the existing linkers are somewhat
too similar, and I believe there are a lot of drastically different
designs that haven't been explored yet. Developers generally don't
care about linkers as long as they work correctly, and they don't
even think about creating a new one. So there may be lots of low
hanging fruits there in this area.
## Basic design
- In order to achieve a `cat`-like performance, the most important
thing is to fix the layout of an output file as quickly as possible, so
that we can start copying actual data from input object files to an
output file as soon as possible.
- Copying data from input files to an output file is I/O-bounded, so
there should be room for doing computationally-intensive tasks while
copying data from one file to another.
- We should allow the linker to preload object files from disk and
parse them in memory before a complete set of input object files
is ready. My idea is this: if a user invokes the linker with
`--preload` flag along with other command line flags a few seconds
before the actual linker invocation, then the following actual
linker invocation with the same command line options (except
`--preload` flag) becomes magically faster. Behind the scenes, the
linker starts preloading object files on the first invocation and
becomes a daemon. The second invocation of the linker notifies the
daemon to reload updated object files and then proceed.
- Daemonizing alone wouldn't make the linker magically faster. We need
to split the linker into two in such a way that the latter half of
the process finishes as quickly as possible by speculatively parsing
and preprocessing input files in the first half of the process. The
key factor of success would be to design nice data structures that
allows us to offload as much processing as possible from the second
to the first half.
- One of the most time-consuming stage among linker stages is symbol
resolution. To resolve symbols, we basically have to throw all
symbol strings into a hash table to match undefined symbols with
defined symbols. But this can be done in the daemon using [string
interning](https://en.wikipedia.org/wiki/String_interning).
- Object files may contain a special section called a mergeable string
section. The section contains lots of null-terminated strings, and
the linker is expected to gather all mergeable string sections and
merge their contents. So, if two object files contain the same
string literal, for example, the resulting output will contain a
single merged string. This step is time-consuming, but string
merging can be done in the daemon using string interning.
- Static archives (.a files) contain object files, but the static
archive's string table contains only defined symbols of member
object files and lacks other types of symbols. That makes static
archives unsuitable for speculative parsing. The daemon should
ignore the string table of static archive and directly read all
member object files of all archives to get the whole picture of
all possible input files.
- If there's a relocation that uses a GOT of a symbol, then we have to
create a GOT entry for that symbol. Otherwise, we shouldn't. That
means we need to scan all relocation tables to fix the length and
the contents of a .got section. This is perhaps time-consuming, but
this step is parallelizable.
## Compatibility
- GNU ld, GNU gold and LLVM lld support essentially the same set of
command line options and features. mold doesn't have to be
completely compatible with them. As long as it can be used for
linking large user-land programs, I'm fine with that. It is OK to
leave some command line options unimplemented; if mold is blazingly
fast, other projects would still be happy to adopt it by modifying
their projects' build files.
- mold emits Linux executables and runs only on Linux. I won't avoid
Unix-ism when writing code (e.g. I'll probably use fork(2)).
I don't want to think about portability until mold becomes a thing
that's worth to be ported.
## Linker Script
Linker script is an embedded language for the linker. It is mainly
used to control how input sections are mapped to output sections and
the layout of the output, but it can also do a lot of tricky stuff.
Its feature is useful especially for embedded programming, but it's
also an awfully underdocumented and complex language.
We have to implement a subset of the linker script language anwyay,
because on Linux, /usr/lib/x86_64-linux-gnu/libc.so is (despite its
name) not a shared object file but actually an ASCII file containing
linker script code to load the _actual_ libc.so file. But the feature
set for this purpose is very limited, and it is okay to implement them
to mold.
Besides that, we really don't want to implement the linker script
langauge. But at the same time, we want to satisfy the user needs that
are currently satisfied with the linker script langauge. So, what
should we do? Here is my observation:
- Linker script allows to do a lot of tricky stuff, such as specifying
the exact layout of a file, inserting arbitrary bytes between
sections, etc. But most of them can be done with a post-link binary
editing tool (such as `objcopy`).
- It looks like there are two things that truely cannot be done by a
post-link editing tool: (a) mapping input sections to output
sections, and (b) applying relocations.
From the above observation, I believe we need to provide only the
following features instead of the entire linker script langauge:
- A method to specify how input sections are mapped to output
sections, and
- a method to set addresses to output sections, so that relocations
are applied based on desired adddresses.
I believe everything else can be done with a post-link binary editing
tool.
## Details
- If we aim to the 1 second goal for Chromium, every millisecond
counts. We can't ignore the latency of process exit. If we mmap a
lot of files, \_exit(2) is not instantaneous but takes a few hundred
milliseconds because the kernel has to clean up a lot of
resources. As a workaround, we should organize the linker command as
two processes; the first process forks the second process, and the
second process does the actual work. As soon as the second process
writes a result file to a filesystem, it notifies the first process,
and the first process exits. The second process can take time to
exit, because it is not an interactive process.
- At least on Linux, it looks like the filesystem's performance to
allocate new blocks to a new file is the limiting factor when
creating a new large file and filling its contents using mmap.
If you already have a large file on a filesystem, writing to it is
much faster than creating a new fresh file and writing to it.
Based on this observation, mold should overwrite to an existing
executable file if exists. My quick benchmark showed that I could
save 300 milliseconds when creating a 2 GiB output file.
Linux doesn't allow to open an executable for writing if it is
running (you'll get "text busy" error if you attempt). mold should
fall back to the usual way if it fails to open an output file.
- As an implementation strategy, we do _not_ care about memory leak
because we really can't save that much memory by doing precise
memory management. It is because most objects that are allocated
during an execution of mold are needed until the very end of the
program. I'm sure this is an odd memory management scheme (or the
lack thereof), but this is what LLVM lld does too.
- The output from the linker should be deterministic for the sake of
[build reproducibility](https://en.wikipedia.org/wiki/Reproducible_builds)
and ease of debugging. This might add a little bit of overhead to
the linker, but that shouldn't be too much.
- A .build-id, a unique ID embedded to an output file, is usually
computed by applying a cryptographic hash function (e.g. SHA-1) to
an output file. This is a slow step, but we can speed it up by
splitting a file into small chunks, computing SHA-1 for each chunk,
and then computing SHA-1 of the concatenated SHA-1 hashes
(i.e. constructing a [Markle
Tree](https://en.wikipedia.org/wiki/Merkle_tree) of height 2).
Modern x86 processors have purpose-built instructions for SHA-1 and
can compute SHA-1 pretty quickly at about 2 GiB/s rate. Using 16
cores, a build-id for a 2 GiB executable can be computed in 60 to 70
milliseconds.
- BFD, gold, and lld support section garbage collection. That is, a
linker runs a mark-sweep garbage collection on an input graph, where
sections are vertices and relocations are edges, to discard all
sections that are not reachable from the entry point symbol
(i.e. `_start`) or a few other root sections. In mold, we are using
multiple threads to mark sections concurrently.
- Similarly, BFD, gold an lld support Identical Comdat Folding (ICF)
as a yet another size optimization. ICF merges two or more read-only
sections that happen to have the same contents and relocations.
To do that, we have to find isomorphic subgraphs from larger graphs.
I implemented a new algorithm for mold, which is 5x faster than lld
to do ICF for Chromium (from 5 seconds to 1 second).
- [Intel Threading Building
Blocks](https://github.com/oneapi-src/oneTBB) (TBB) is a good
library for parallel execution and has several concurrent
containers. We are particularly interested in using
`parallel_for_each` and `concurrent_hash_map`.
- TBB provides `tbbmalloc` which works better for multi-threaded
applications than the glib'c malloc, but it looks like
[jemalloc](https://github.com/jemalloc/jemalloc) and
[mimalloc](https://github.com/microsoft/mimalloc) are a little bit
more scalable than `tbbmalloc`.
## Size of the problem
When linking Chrome, a linker reads 3,430,966,844 bytes of data in
total. The data contains the following items:
| Data item | Number
| ------------------------ | ------
| Object files | 30,723
| Public undefined symbols | 1,428,149
| Mergeable strings | 1,579,996
| Comdat groups | 9,914,510
| Regular sections¹ | 10,345,314
| Public defined symbols | 10,512,135
| Symbols | 23,953,607
| Sections | 27,543,225
| Relocations against SHF_ALLOC sections | 39,496,375
| Relocations | 62,024,719
¹ Sections that have to be copied from input object files to an
output file. Sections that contain relocations or symbols are for
example excluded.
## Internals
In this section, I'll explain the internals of mold linker.
### A brief history of Unix and the Unix linker
Conceptually, what a linker does is pretty simple. A compiler compiles
a fragment of a program (a single source file) into a fragment of
machine code and data (an object file, which typically has the .o
extension), and a linker stiches them together into a single
executable or a shared library image.
In reality, modern linkers for Unix-like systems are much more
compilcated than the naive understanding because they have gradually
gained one feature at a time over the 50 years history of Unix, and
they are now something like a bag of lots of miscellaneous features in
which none of the features is more important than the others. It is
very easy to miss the forest for the trees, since for those who don't
know the details of the Unix linker, it is not clear which feature is
essential and which is not.
That being said, one thing is clear that at any point of Unix history,
a Unix linker has a coherent feature set for the Unix of that age. So,
let me entangle the history to see how the operating system, runtime
and linker have gained features that we see today. That should give
you an idea why a particular feature has been added to a linker in the
first place.
1. Original Unix didn't support shared library, and a program was
always loaded to a fixed address. An executable was something like
a memory dump which was just loaded to a particular address by the
kernel. After loading, the kernel started executing the program by
setting the instruction pointer to a particular address.
The most essential feature for any linker is relocation processing.
The original Unix linker of course supported that. Let me explain
what that is.
Individual object files are inevitably incomplete as a program,
because when a compiler created them, it only see a part of an
entire program. For example, if an object file contains a function
call that refers other object file, the `call` instruction in the
object cannot be complete, as the compiler has no idea as to what
is the called function's address. To deal with this, the compiler
emits a placeholder value (typically just zero) instead of a real
address and leave a metadata in an object file saying "fix offset X
of this file with an address of Y". That metadata is called
"relocation". Relocations are typically processed by the linker.
It is easy for a linker to apply relocations for the original Unix
because a program is always loaded to a fixed address. It exactly
knows the addresses of all functions and data when linking a
program.
Static library support, which is still an important feature of Unix
linker, also dates back to this early period of Unix history.
To understand what it is, imagine that you are trying to compile
a program for the early Unix. You don't want to waste time to
compile libc functions every time you compile your program (the
computers of the era was incredibly slow), so you have already
placed each libc function into a separate source file and compiled
them individually. That means, you have object files for each libc
function, e.g., printf.o, scanf.o, atoi.o, write.o, etc.
Given this configuration, all you have to do to link your program
against libc functions is to pick up a right set of libc object
files and give them to the linker along with the object files of your
program. But, keeping the linker command line in sync with the
libc functions you are using in your program is bothersome. You can
be conservative; you can specify all libc object files to the
command line, but that leads to program bloat because the linker
unconditionally link all object files given to it no matter whether
they are used or not. So, a new feature was added to the linker to
fix the problem. That is the static library, which is also called
the archive file.
An archive file is just a bundle of object files, just like zip
file but in an uncompressed form. An achive file typically has the
.a file extension and named after its contents. For example, the
archive file containing all libc objects is named `libc.a`.
If you pass an archive file along with other object files to the
linker, the linker pulls out an object file from the archive _only
when_ it is referenced by other object files. In other words,
unlike object files directly given to a linker, object files
wrapped in an archive are not linked to an output by default.
An archive works as supplements to complete your program.
Even today, you can still find a libc archive file. Run `ar t
/usr/lib/x86_64-linux-gnu/libc.a` on Linux should give you a list
of object files in the libc archive.
2. In '80s, Sun Microsystems, a leading commercial Unix vendor at the
time, added a shared library support to their Unix variant, SunOS.
(This section is incomplete.)
## Concurrency strategy
In this section, I'll explain the high level concurrency strategy of
mold.
In most places, mold adopts data parallelism. That is, we have a huge
number of piece of data of the same kind, and we process each of them
individually using parallel for-loop. For example, after identifying
the exact set of input object files, we need to scan all relocation
tables to determine the sizes of .got and .plt sections. We do that
using a parallel for-loop. The granularity of parallel processing in
this case is the relocation table.
Data parallelism is very efficient and scalable because there's no
need for threads to communicate with each other while working on each
element of data. In addition to that, data parallelism is easy to
understand, as it is just a for-loop in which multiple iterations may
be executed in parallel. We don't use high-level communication or
synchronization mechanisms such as channels, futures, promises,
latches or something like that in mold.
In some cases, we need to share a little bit of data between threads
while executing a parallel for-loop. For example, the loop to scan
relocations turns on "requires GOT" or "requires PLT" flags in a
symbol. Symbol is a shared resource, and writing to them from multiple
threads without synchronization is unsafe. To deal with it, we made
the flag an atomic variable.
The other common pattern you can find in mold which is build on top of
the parallel for-loop is the map-reduce pattern. That is, we run a
parallel for-loop on a large data set to produce a small data set and
process the small data set with a single thread. Let me take a
build-id computation as an example. Build-id is typically computed by
applying a cryptographic hash function such as SHA-1 on a linker's
output file. To compute it, we first consider an output as a sequence
of 1 MiB blocks and compute a SHA-1 hash for each block in parallel.
Then, we concatenate the SHA-1 hashes and compute a SHA-1 hash on the
hashes to get a final build-id.
Finally, we use concurrent hashmap at a few places in mold. Concurrent
hashmap is a hashmap to which multiple threads can safely insert items
in parallel. We use it in the symbol resolution stage, for example.
To resolve symbols, we basically have to throw in all defined symbols
into a hash table, so that we can find a matching defined symbol for
an undefined symbol by name. We do the hash table insertion from a
parallel for-loop which iterates over a list of input files.
Overall, even though mold is highly scalable, it succeeded to avoid
complexties you often find in complex parallel programs. From high
level, mold just serially executes linker's internal passes one by
one. Each pass is parallelized using parallel for-loops.
## Rejected ideas
In this section, I'll explain the alternative designs I currently do
not plan to implement and why I turned them down.
- Placing variable-length sections at end of an output file and start
copying file contents before fixing the output file layout
Idea: Fixing the layout of regular sections seems easy, and if we
place them at beginning of a file, we can start copying their
contents from their input files to an output file. While copying
file contents, we can compute the sizes of variable-length sections
such as .got or .plt and place them at end of the file.
Reason for rejection: I did not choose this design because I doubt
if it could actually shorten link time and I think I don't need it
anyway.
The linker has to de-duplicate comdat sections (i.e. inline
functions that are included into multiple object files), so we
cannot compute the layout of regular sections until we resolve all
symbols and de-duplicate comdats. That takes a few hundred
milliseconds. After that, we can compute the sizes of
variable-length sections in less than 100 milliseconds. It's quite
fast, so it doesn't seem to make much sense to proceed without
fixing the final file layout.
The other reason to reject this idea is because there's good a
chance for this idea to have a negative impact on linker's overall
performance. If we copy file contents before fixing the layout, we
can't apply relocations to them while copying because symbol
addresses are not available yet. If we fix the file layout first, we
can apply relocations while copying, which is effectively zero-cost
due to a very good data locality. On the other hand, if we apply
relocations long after we copy file contents, it's pretty expensive
because section contents are very likely to have been evicted from
CPU cache.
- Incremental linking
Idea: Incremental linking is a technique to patch a previous linker's
output file so that only functions or data that are updated from the
previous build are written to it. It is expected to significantly
reduce the amount of data copied from input files to an output file
and thus speed up linking. GNU BFD and gold linkers support it.
Reason for rejection: I turned it down because it (1) is
complicated, (2) doesn't seem to speed it up that much and (3) has
several practical issues. Let me explain each of them.
First, incremental linking for real C/C++ programs is not as easy as
one might think. Let me take malloc as an example. malloc is usually
defined by libc, but you can implement it in your program, and if
that's the case, the symbol `malloc` will be resolved to your
function instead of the one in libc. If you include a library that
defines malloc (such as libjemalloc or libtbbmallc) before libc,
their malloc will override libc's malloc.
Assume that you are using a nonstandard malloc. What if you remove
your malloc from your code, or remove `-ljemalloc` from your
Makefile? The linker has to include a malloc from libc, which may
include more object files to satisfy its dependencies. Such code
change can affect the entire program rather than just replacing one
function. The same is true to adding malloc to your program. Making
a local change doesn't necessarily result in a local change in the
binary level. It can easily have cascading effects.
Some ELF fancy features make incremental linking even harder to
implement. Take the weak symbol as an example. If you define `atoi`
as a weak symbol in your program, and if you are not using `atoi`
at all in your program, that symbol will be resolved to address
0. But if you start using some libc function that indirectly calls
`atoi`, then `atoi` will be included to your program, and your weak
symbol will be resolved to that function. I don't know how to
efficiently fix up a binary for this case.
This is a hard problem, so existing linkers don't try too hard to
solve it. For example, IIRC, gold falls back to full link if any
function is removed from a previous build. If you want to not annoy
users in the fallback case, you need to make full link fast anyway.
Second, incremental linking itself has an overhead. It has to detect
updated files, patch an existing output file and write additional
data to an output file for future incremental linking. GNU gold, for
instance, takes almost 30 seconds on my machine to do a null
incremental link (i.e. no object files are updated from a previous
build) for chrome. It's just too slow.
Third, there are other practical issues in incremental linking. It's
not reproducible, so your binary isn't going to be the same as other
binaries even if you are compiling the same source tree using the
same compiler toolchain. Or, it is complex and there might be a bug
in it. If something doesn't work correctly, "remove --incremental
from your Makefile and try again" could be a piece of advise, but
that isn't ideal.
So, all in all, incremental linking is tricky. I wanted to make full
link as fast as possible, so that we don't have to think about how
to workaround the slowness of full link.
- Defining a completely new file format and use it
Idea: Sometimes, the ELF file format itself seems to be a limiting
factor of improving linker's performance. We might be able to make a
far better one if we create a new file format.
Reason for rejection: I rejected the idea because it apparently has
a practical issue (backward compatibility issue) and also doesn't
seem to improve performance of linkers that much. As clearly
demonstrated by mold, we can create a fast linker for ELF. I believe
ELF isn't that bad, after all. The semantics of the existing Unix
linkers, such as the name resolution algorithm or the linker script,
have slowed the linkers down, but that's not a problem of the file
format itself.
- Watching object files using inotify(2)
Idea: When mold is running as a daemon for preloading, use
inotify(2) to watch file system updates so that it can reload files
as soon as they are updated.
Reason for rejection: Just like the maximum number of files you can
simultaneously open, the maximum number of files you can watch using
inotify(2) isn't that large. Maybe just a single instance of mold is
fine with inotify(2), but it may fail if you run multiple of it.
The other reason for not doing it is because mold is quite fast
without it anyway. Invoking stat(2) on each file for file update
check takes less than 100 milliseconds for Chrome, and if most of
the input files are not updated, parsing updated files takes almost
no time.