mirror of
https://github.com/rui314/mold.git
synced 2024-11-15 04:10:40 +03:00
630 lines
30 KiB
Markdown
630 lines
30 KiB
Markdown
# mold: A Modern Linker
|
|
|
|
![mold image](docs/mold.jpg)
|
|
|
|
mold is a high-performance drop-in replacement for existing Unix
|
|
linkers. It is several times faster than LLVM lld linker, the (then-)
|
|
fastest open-source linker which I originally created a few years ago.
|
|
Here is a performance comparison of GNU gold, LLVM lld, and mold for
|
|
linking final executables of major large programs.
|
|
|
|
| Program (linker output size) | GNU gold | LLVM lld | mold | mold w/ preloading
|
|
|-------------------------------|----------|----------|-------|-------------------
|
|
| Firefox 87 (1.6 GiB) | 29.2s | 6.16s | 1.69s | 0.79s
|
|
| Chrome 86 (1.9 GiB) | 54.5s | 11.7s | 1.85s | 0.97s
|
|
| Clang 13 (3.1 GiB) | 59.4s | 5.68s | 2.76s | 0.86s
|
|
|
|
(These numbers are measured on an AMD Threadripper 3990X 64-core
|
|
machine with 32 threads enabled. All programs are built with debug
|
|
info enabled.)
|
|
|
|
Let me explain the "w/ preloading" column. mold supports the file
|
|
preloading feature. That is, if you run mold with `-preload` flag
|
|
along with other command-line flags, it becomes a daemon and halts
|
|
after parsing input files. Then, if you invoke mold with the same
|
|
command-line options (except `-preload` flag), it tells the daemon to
|
|
reload only updated files and proceed. With this feature enabled, and
|
|
if most of the input files haven't been updated, mold achieves a
|
|
near-`cp` performance or even exceeds it, as the throughput of file
|
|
copy using the `cp` command is about 2 GiB/s on my machine.
|
|
|
|
So, mold is extremely fast per se and even faster with a bit of cheating.
|
|
|
|
Why is mold so fast? One reason is that it simply uses faster
|
|
algorithms and efficient data structures than other linkers do.
|
|
The other reason is that the new linker is highly parallelized.
|
|
|
|
Here is a side-by-side comparison of per-core CPU usage of lld (left)
|
|
and mold (right). They are linking the same program, Chromium
|
|
executable.
|
|
|
|
![](docs/htop.gif)
|
|
|
|
As you can see, mold uses all available cores throughout its execution
|
|
and finishes quickly. On the other hand, lld failed to use available
|
|
cores most of the time. In this demo, the maximum parallelism is
|
|
artificially capped to 16 so that the bars fit in the GIF.
|
|
|
|
Currently, mold is being developed with Linux/x86-64 as the primary
|
|
target platform. mold can link many user-land programs including large
|
|
ones such as web browsers for that target. It also has preliminary
|
|
Linux/i386 support. Supporting other OSes and ISAs are planned after
|
|
Linux/x86-64 support is complete.
|
|
|
|
## How to build
|
|
|
|
mold is written in C++20, so you need a very recent version of GCC or
|
|
Clang. I'm using Ubuntu 20.04 as a development platform. In that
|
|
environment, you can build mold by the following commands.
|
|
|
|
```
|
|
$ sudo apt-get install build-essential libstdc++-10-dev cmake clang libssl-dev zlib1g-dev libxxhash-dev git
|
|
$ git clone https://github.com/rui314/mold.git
|
|
$ cd mold
|
|
$ git checkout v0.9.6
|
|
$ make
|
|
```
|
|
|
|
The last `make` command creates `mold` executable.
|
|
|
|
If you don't have Ubuntu 20.04, or if for any reason `make` in the
|
|
above commands doesn't work for you, you can use Docker to build it in
|
|
a Docker environment. To do so, just run `./build-static.sh` in this
|
|
directory. The script creates a Ubuntu 20.04 Docker image, installs
|
|
necessary tools and libraries to it, and builds mold as a static binary.
|
|
|
|
`make test` depends on a few more packages. To install, run the following commands:
|
|
|
|
```
|
|
$ sudo dpkg --add-architecture i386
|
|
$ sudo apt update
|
|
$ sudo apt-get install bsdmainutils dwarfdump libc6-dev:i386 lib32gcc-10-dev libstdc++-10-dev-arm64-cross gcc-10-aarch64-linux-gnu g++-10-aarch64-linux-gnu
|
|
```
|
|
|
|
## How to use
|
|
|
|
On Unix, the linker command (which is usually `/usr/bin/ld`) is
|
|
invoked indirectly by `cc` (or `gcc` or `clang`), which is typically
|
|
in turn indirectly invoked by `make` or some other build system command.
|
|
|
|
A classic way to use `mold`:
|
|
- `clang` before 12.0: pass `-fuse-ld=<absolute-path-to-mold-executable>`;
|
|
- clang after 12.0: pass `--ld-path=<absolute-path-to-mold-executable>`;
|
|
- gcc: `--ld-path` patch [has been declined by GCC maintainers](https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573833.html), instead they advise to use a [workaround](https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573823.html): create directory `<dirname>`, then `ln -s <path-to-mold> <dirname>/ld`, and then pass `-B<dirname>` (`-B` tells GCC to look for `ld` in specified location).
|
|
|
|
It is sometimes very hard to pass an appropriate command line option
|
|
to `cc` to specify an alternative linker. To deal with the situation,
|
|
mold has a feature to intercept all invocations of `ld`, `ld.lld` or
|
|
`ld.gold` and redirect it to itself. To use the feature, run `make`
|
|
(or another build command) as a subcommand of mold as follows:
|
|
|
|
```
|
|
$ path/to/mold -run make <make-options-if-any>
|
|
```
|
|
|
|
Here's an example showing how to link Rust code when using the
|
|
cargo package manager:
|
|
|
|
```
|
|
$ path/to/mold -run cargo build
|
|
```
|
|
|
|
Internally, mold invokes a given command with `LD_PRELOAD` environment
|
|
variable set to its companion shared object file. The shared object
|
|
file intercepts all function calls to `exec(3)`-family functions to
|
|
replace `argv[0]` with `mold` if it is `ld`, `ld.gold` or `ld.lld`.
|
|
|
|
mold leaves its identification string in `.comment` section in an output
|
|
file. You can print it out to verify that you are actually using mold.
|
|
|
|
```
|
|
$ readelf -p .comment <executable-file>
|
|
|
|
String dump of section '.comment':
|
|
[ 0] GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
|
|
[ 2b] mold 9a1679b47d9b22012ec7dfbda97c8983956716f7
|
|
```
|
|
|
|
If `mold` is in `.comment`, the file is created by mold.
|
|
|
|
# Design and implementation of mold
|
|
|
|
For the rest of this documentation, I'll explain the design and the
|
|
implementation of mold. If you are only interested in using mold, you
|
|
don't need to read the below.
|
|
|
|
## Motivation
|
|
|
|
Here is why I'm writing a new linker:
|
|
|
|
- Even though lld has significantly improved the situation, linking is
|
|
still one of the slowest steps in a build. It is especially
|
|
annoying when I changed one line of code and had to wait for a few
|
|
seconds or even more for a linker to complete. It should be
|
|
instantaneous. There's a need for a faster linker.
|
|
|
|
- The number of cores on a PC has increased a lot lately, and this
|
|
trend is expected to continue. However, the existing linkers can't
|
|
take the advantage of the trend because they don't scale well for more
|
|
cores. I have a 64-core/128-thread machine, so my goal is to create
|
|
a linker that uses the CPU nicely. mold should be much faster than
|
|
other linkers on 4 or 8-core machines too, though.
|
|
|
|
- It looks to me that the designs of the existing linkers are somewhat
|
|
too similar, and I believe there are a lot of drastically different
|
|
designs that haven't been explored yet. Developers generally don't
|
|
care about linkers as long as they work correctly, and they don't
|
|
even think about creating a new one. So there may be lots of low
|
|
hanging fruits there in this area.
|
|
|
|
## Basic design
|
|
|
|
- In order to achieve a `cp`-like performance, the most important
|
|
thing is to fix the layout of an output file as quickly as possible, so
|
|
that we can start copying actual data from input object files to an
|
|
output file as soon as possible.
|
|
|
|
- Copying data from input files to an output file is I/O-bounded, so
|
|
there should be room for doing computationally-intensive tasks while
|
|
copying data from one file to another.
|
|
|
|
- We should allow the linker to preload object files from disk and
|
|
parse them in memory before a complete set of input object files
|
|
is ready. To do so, we need
|
|
to split the linker into two in such a way that the latter half of
|
|
the process finishes as quickly as possible by speculatively parsing
|
|
and preprocessing input files in the first half of the process.
|
|
|
|
- One of the most computationally-intensive stage among linker stages
|
|
is symbol resolution. To resolve symbols, we basically have to throw
|
|
all symbol strings into a hash table to match undefined symbols with
|
|
defined symbols. But this can be done in the preloading stage using
|
|
[string interning](https://en.wikipedia.org/wiki/String_interning).
|
|
|
|
- Object files may contain a special section called a mergeable string
|
|
section. The section contains lots of null-terminated strings, and
|
|
the linker is expected to gather all mergeable string sections and
|
|
merge their contents. So, if two object files contain the same
|
|
string literal, for example, the resulting output will contain a
|
|
single merged string. This step is computationally intensive, but string
|
|
merging can be done in the preloading stage using string interning.
|
|
|
|
- Static archives (.a files) contain object files, but the static
|
|
archive's string table contains only defined symbols of member
|
|
object files and lacks other types of symbols. That makes static
|
|
archives unsuitable for speculative parsing. Therefore, the linker
|
|
should ignore the symbol table of static archive and directly read
|
|
static archive members.
|
|
|
|
- If there's a relocation that uses a GOT of a symbol, then we have to
|
|
create a GOT entry for that symbol. Otherwise, we shouldn't. That
|
|
means we need to scan all relocation tables to fix the length and
|
|
the contents of a .got section. This is computationally intensive,
|
|
but this step is parallelizable.
|
|
|
|
## Compatibility
|
|
|
|
- GNU ld, GNU gold and LLVM lld support essentially the same set of
|
|
command line options and features. mold doesn't have to be
|
|
completely compatible with them. As long as it can be used for
|
|
linking large user-land programs, I'm fine with that. It is OK to
|
|
leave some command line options unimplemented; if mold is blazingly
|
|
fast, other projects would still be happy to adopt it by modifying
|
|
their projects' build files.
|
|
|
|
- mold emits Linux executables and runs only on Linux. I won't avoid
|
|
Unix-ism when writing code. I don't want to think about portability
|
|
until mold becomes a thing that's worth being ported.
|
|
|
|
## Linker Script
|
|
|
|
Linker script is an embedded language for the linker. It is mainly
|
|
used to control how input sections are mapped to output sections and
|
|
the layout of the output, but it can also do a lot of tricky stuff.
|
|
Its feature is useful especially for embedded programming, but it's
|
|
also an awfully underdocumented and complex language.
|
|
|
|
We have to implement a subset of the linker script language anwyay,
|
|
because on Linux, /usr/lib/x86_64-linux-gnu/libc.so is (despite its
|
|
name) not a shared object file but actually an ASCII file containing
|
|
linker script code to load the _actual_ libc.so file. But the feature
|
|
set for this purpose is very limited, and it is okay to implement them
|
|
to mold.
|
|
|
|
Besides that, we really don't want to implement the linker script
|
|
language. But at the same time, we want to satisfy the user needs that
|
|
are currently satisfied with the linker script language. So, what
|
|
should we do? Here is my observation:
|
|
|
|
- Linker script allows doing a lot of tricky stuff, such as specifying
|
|
the exact layout of a file, inserting arbitrary bytes between
|
|
sections, etc. But most of them can be done with a post-link binary
|
|
editing tool (such as `objcopy`).
|
|
|
|
- It looks like there are two things that truly cannot be done by a
|
|
post-link editing tool: (a) mapping input sections to output
|
|
sections, and (b) applying relocations.
|
|
|
|
From the above observation, I believe we need to provide only the
|
|
following features instead of the entire linker script language:
|
|
|
|
- A method to specify how input sections are mapped to output
|
|
sections, and
|
|
|
|
- a method to set addresses to output sections, so that relocations
|
|
are applied based on desired addresses.
|
|
|
|
I believe everything else can be done with a post-link binary editing
|
|
tool.
|
|
|
|
## Details
|
|
|
|
- As we aim to the 1-second goal for Chromium, every millisecond
|
|
counts. We can't ignore the latency of process exit. If we mmap a
|
|
lot of files, \_exit(2) is not instantaneous but takes a few hundred
|
|
milliseconds because the kernel has to clean up a lot of
|
|
resources. As a workaround, we should organize the linker command as
|
|
two processes; the first process forks the second process, and the
|
|
second process does the actual work. As soon as the second process
|
|
writes a result file to a filesystem, it notifies the first process,
|
|
and the first process exits. The second process can take time to
|
|
exit, because it is not an interactive process.
|
|
|
|
- At least on Linux, it looks like the filesystem's performance to
|
|
allocate new blocks to a new file is the limiting factor when
|
|
creating a new large file and filling its contents using mmap.
|
|
If you already have a large file in the buffer cache, writing to it is
|
|
much faster than creating a new fresh file and writing to it.
|
|
Based on this observation, mold overwrites to an existing
|
|
executable file if exists. My quick benchmark showed that I could
|
|
save 300 milliseconds when creating a 2 GiB output file.
|
|
Linux doesn't allow to open an executable for writing if it is
|
|
running (you'll get a "text busy" error if you attempt). mold
|
|
falls back to the usual way if it fails to open an output file.
|
|
|
|
- The output from the linker should be deterministic for the sake of
|
|
[build reproducibility](https://en.wikipedia.org/wiki/Reproducible_builds)
|
|
and ease of debugging. This might add a little bit of overhead to
|
|
the linker, but that shouldn't be too much.
|
|
|
|
- A .build-id, a unique ID embedded to an output file, is usually
|
|
computed by applying a cryptographic hash function (e.g. SHA-1) to
|
|
an output file. This is a slow step, but we can speed it up by
|
|
splitting a file into small chunks, computing SHA-1 for each chunk,
|
|
and then computing SHA-1 of the concatenated SHA-1 hashes
|
|
(i.e. constructing a [Merkle
|
|
Tree](https://en.wikipedia.org/wiki/Merkle_tree) of height 2).
|
|
Modern x86 processors have purpose-built instructions for SHA-1 and
|
|
can compute SHA-1 pretty quickly at about 2 GiB/s. Using 16
|
|
cores, a build-id for a 2 GiB executable can be computed in 60 to 70
|
|
milliseconds.
|
|
|
|
- BFD, gold, and lld support section garbage collection. That is, a
|
|
linker runs a mark-sweep garbage collection on an input graph, where
|
|
sections are vertices and relocations are edges, to discard all
|
|
sections that are not reachable from the entry point symbol
|
|
(i.e. `_start`) or a few other root sections. In mold, we are using
|
|
multiple threads to mark sections concurrently.
|
|
|
|
- Similarly, BFD, gold an lld support Identical Comdat Folding (ICF)
|
|
as yet another size optimization. ICF merges two or more read-only
|
|
sections that happen to have the same contents and relocations.
|
|
To do that, we have to find isomorphic subgraphs from larger graphs.
|
|
I implemented a new algorithm for mold, which is 5x faster than lld
|
|
to do ICF for Chromium (from 5 seconds to 1 second).
|
|
|
|
- [Intel Threading Building
|
|
Blocks](https://github.com/oneapi-src/oneTBB) (TBB) is a good
|
|
library for parallel execution and has several concurrent
|
|
containers. We are particularly interested in using
|
|
`parallel_for_each` and `concurrent_hash_map`.
|
|
|
|
- TBB provides `tbbmalloc` which works better for multi-threaded
|
|
applications than the glib'c malloc, but it looks like
|
|
[jemalloc](https://github.com/jemalloc/jemalloc) and
|
|
[mimalloc](https://github.com/microsoft/mimalloc) are a little bit
|
|
more scalable than `tbbmalloc`.
|
|
|
|
## Size of the problem
|
|
|
|
When linking Chrome, a linker reads 3,430,966,844 bytes of data in
|
|
total. The data contains the following items:
|
|
|
|
| Data item | Number
|
|
| ------------------------ | ------
|
|
| Object files | 30,723
|
|
| Public undefined symbols | 1,428,149
|
|
| Mergeable strings | 1,579,996
|
|
| Comdat groups | 9,914,510
|
|
| Regular sections¹ | 10,345,314
|
|
| Public defined symbols | 10,512,135
|
|
| Symbols | 23,953,607
|
|
| Sections | 27,543,225
|
|
| Relocations against SHF_ALLOC sections | 39,496,375
|
|
| Relocations | 62,024,719
|
|
|
|
¹ Sections that have to be copied from input object files to an
|
|
output file. Sections that contain relocations or symbols are for
|
|
example excluded.
|
|
|
|
## Internals
|
|
|
|
In this section, I'll explain the internals of mold linker.
|
|
|
|
### A brief history of Unix and the Unix linker
|
|
|
|
Conceptually, what a linker does is pretty simple. A compiler compiles
|
|
a fragment of a program (a single source file) into a fragment of
|
|
machine code and data (an object file, which typically has the .o
|
|
extension), and a linker stitches them together into a single
|
|
executable or a shared library image.
|
|
|
|
In reality, modern linkers for Unix-like systems are much more
|
|
complicated than the naive understanding because they have gradually
|
|
gained one feature at a time over the 50 years history of Unix, and
|
|
they are now something like a bag of lots of miscellaneous features in
|
|
which none of the features is more important than the others. It is
|
|
very easy to miss the forest for the trees, since for those who don't
|
|
know the details of the Unix linker, it is not clear which feature is
|
|
essential and which is not.
|
|
|
|
That being said, one thing is clear that at any point of Unix history,
|
|
a Unix linker has a coherent feature set for the Unix of that age. So,
|
|
let me entangle the history to see how the operating system, runtime,
|
|
and linker have gained features that we see today. That should give
|
|
you an idea of why a particular feature has been added to a linker in the
|
|
first place.
|
|
|
|
1. Original Unix didn't support shared libraries, and a program was
|
|
always loaded to a fixed address. An executable was something like
|
|
a memory dump that was just loaded to a particular address by the
|
|
kernel. After loading, the kernel started executing the program by
|
|
setting the instruction pointer to a particular address.
|
|
|
|
The most essential feature for any linker is relocation processing.
|
|
The original Unix linker of course supported that. Let me explain
|
|
what that is.
|
|
|
|
Individual object files are inevitably incomplete as a program,
|
|
because when a compiler created them, it only see a part of an
|
|
entire program. For example, if an object file contains a function
|
|
call that refers to another object file, the `call` instruction in the
|
|
object cannot be complete, as the compiler has no idea as to what
|
|
is the called function's address. To deal with this, the compiler
|
|
emits a placeholder value (typically just zero) instead of a real
|
|
address and leaves metadata in an object file saying "fix offset X
|
|
of this file with an address of Y". That metadata is called
|
|
"relocation". Relocations are typically processed by the linker.
|
|
|
|
It is easy for a linker to apply relocations for the original Unix
|
|
because a program is always loaded to a fixed address. It exactly
|
|
knows the addresses of all functions and data when linking a
|
|
program.
|
|
|
|
Static library support, which is still an important feature of Unix
|
|
linker, also dates back to this early period of Unix history.
|
|
To understand what it is, imagine that you are trying to compile
|
|
a program for the early Unix. You don't want to waste time to
|
|
compile libc functions every time you compile your program (the
|
|
computers of the era were incredibly slow), so you have already
|
|
placed each libc function into a separate source file and compiled
|
|
them individually. That means you have object files for each libc
|
|
function, e.g., printf.o, scanf.o, atoi.o, write.o, etc.
|
|
|
|
Given this configuration, all you have to do to link your program
|
|
against libc functions is to pick up the right set of libc object
|
|
files and give them to the linker along with the object files of your
|
|
program. But, keeping the linker command line in sync with the
|
|
libc functions you are using in your program is bothersome. You can
|
|
be conservative; you can specify all libc object files to the
|
|
command line, but that leads to program bloat because the linker
|
|
unconditionally link all object files given to it no matter whether
|
|
they are used or not. So, a new feature was added to the linker to
|
|
fix the problem. That is the static library, which is also called
|
|
the archive file.
|
|
|
|
An archive file is just a bundle of object files, just like zip
|
|
file but in an uncompressed form. An archive file typically has the
|
|
.a file extension and named after its contents. For example, the
|
|
archive file containing all libc objects is named `libc.a`.
|
|
|
|
If you pass an archive file along with other object files to the
|
|
linker, the linker pulls out an object file from the archive _only
|
|
when_ it is referenced by other object files. In other words,
|
|
unlike object files directly given to a linker, object files
|
|
wrapped in an archive are not linked to the output by default.
|
|
An archive works as a supplement to complete your program.
|
|
|
|
Even today, you can still find a libc archive file. Run `ar t
|
|
/usr/lib/x86_64-linux-gnu/libc.a` on Linux should give you a list
|
|
of object files in the libc archive.
|
|
|
|
2. In the '80s, Sun Microsystems, a leading commercial Unix vendor at the
|
|
time, added shared library support to their Unix variant, SunOS.
|
|
|
|
(This section is incomplete.)
|
|
|
|
## Concurrency strategy
|
|
|
|
In this section, I'll explain the high-level concurrency strategy of
|
|
mold.
|
|
|
|
In most places, mold adopts data parallelism. That is, we have a huge
|
|
number of pieces of data of the same kind, and we process each of them
|
|
individually using parallel for-loop. For example, after identifying
|
|
the exact set of input object files, we need to scan all relocation
|
|
tables to determine the sizes of .got and .plt sections. We do that
|
|
using a parallel for-loop. The granularity of parallel processing in
|
|
this case is the relocation table.
|
|
|
|
Data parallelism is very efficient and scalable because there's no
|
|
need for threads to communicate with each other while working on each
|
|
element of data. In addition to that, data parallelism is easy to
|
|
understand, as it is just a for-loop in which multiple iterations may
|
|
be executed in parallel. We don't use high-level communication or
|
|
synchronization mechanisms such as channels, futures, promises,
|
|
latches or something like that in mold.
|
|
|
|
In some cases, we need to share a little bit of data between threads
|
|
while executing a parallel for-loop. For example, the loop to scan
|
|
relocations turns on "requires GOT" or "requires PLT" flags in a
|
|
symbol. Symbol is a shared resource, and writing to them from multiple
|
|
threads without synchronization is unsafe. To deal with it, we made
|
|
the flag an atomic variable.
|
|
|
|
The other common pattern you can find in mold which is build on top of
|
|
the parallel for-loop is the map-reduce pattern. That is, we run a
|
|
parallel for-loop on a large data set to produce a small data set and
|
|
process the small data set with a single thread. Let me take a
|
|
build-id computation as an example. Build-id is typically computed by
|
|
applying a cryptographic hash function such as SHA-1 on a linker's
|
|
output file. To compute it, we first consider an output as a sequence
|
|
of 1 MiB blocks and compute a SHA-1 hash for each block in parallel.
|
|
Then, we concatenate the SHA-1 hashes and compute a SHA-1 hash on the
|
|
hashes to get a final build-id.
|
|
|
|
Finally, we use concurrent hashmap at a few places in mold. Concurrent
|
|
hashmap is a hashmap to which multiple threads can safely insert items
|
|
in parallel. We use it in the symbol resolution stage, for example.
|
|
To resolve symbols, we basically have to throw in all defined symbols
|
|
into a hash table, so that we can find a matching defined symbol for
|
|
an undefined symbol by name. We do the hash table insertion from a
|
|
parallel for-loop which iterates over a list of input files.
|
|
|
|
Overall, even though mold is highly scalable, it succeeded to avoid
|
|
complexties you often find in complex parallel programs. From high
|
|
level, mold just serially executes the linker's internal passes one by
|
|
one. Each pass is parallelized using parallel for-loops.
|
|
|
|
## Rejected ideas
|
|
|
|
In this section, I'll explain the alternative designs I currently do
|
|
not plan to implement and why I turned them down.
|
|
|
|
- Placing variable-length sections at end of an output file and start
|
|
copying file contents before fixing the output file layout
|
|
|
|
Idea: Fixing the layout of regular sections seems easy, and if we
|
|
place them at beginning of a file, we can start copying their
|
|
contents from their input files to an output file. While copying
|
|
file contents, we can compute the sizes of variable-length sections
|
|
such as .got or .plt and place them at end of the file.
|
|
|
|
Reason for rejection: I did not choose this design because I doubt
|
|
if it could actually shorten link time and I think I don't need it
|
|
anyway.
|
|
|
|
The linker has to de-duplicate comdat sections (i.e. inline
|
|
functions that are included in multiple object files), so we
|
|
cannot compute the layout of regular sections until we resolve all
|
|
symbols and de-duplicate comdats. That takes a few hundred
|
|
milliseconds. After that, we can compute the sizes of
|
|
variable-length sections in less than 100 milliseconds. It's quite
|
|
fast, so it doesn't seem to make much sense to proceed without
|
|
fixing the final file layout.
|
|
|
|
The other reason to reject this idea is because there's good a
|
|
chance for this idea to have a negative impact on linker's overall
|
|
performance. If we copy file contents before fixing the layout, we
|
|
can't apply relocations to them while copying because symbol
|
|
addresses are not available yet. If we fix the file layout first, we
|
|
can apply relocations while copying, which is effectively zero-cost
|
|
due to a very good data locality. On the other hand, if we apply
|
|
relocations long after we copy file contents, it's pretty expensive
|
|
because section contents are very likely to have been evicted from
|
|
CPU cache.
|
|
|
|
- Incremental linking
|
|
|
|
Idea: Incremental linking is a technique to patch a previous linker's
|
|
output file so that only functions or data that are updated from the
|
|
previous build are written to it. It is expected to significantly
|
|
reduce the amount of data copied from input files to an output file
|
|
and thus speed up linking. GNU BFD and gold linkers support it.
|
|
|
|
Reason for rejection: I turned it down because it (1) is
|
|
complicated, (2) doesn't seem to speed it up that much and (3) has
|
|
several practical issues. Let me explain each of them.
|
|
|
|
First, incremental linking for real C/C++ programs is not as easy as
|
|
one might think. Let me take malloc as an example. malloc is usually
|
|
defined by libc, but you can implement it in your program, and if
|
|
that's the case, the symbol `malloc` will be resolved to your
|
|
function instead of the one in libc. If you include a library that
|
|
defines malloc (such as libjemalloc or libtbbmallc) before libc,
|
|
their malloc will override libc's malloc.
|
|
|
|
Assume that you are using a nonstandard malloc. What if you remove
|
|
your malloc from your code, or remove `-ljemalloc` from your
|
|
Makefile? The linker has to include a malloc from libc, which may
|
|
include more object files to satisfy its dependencies. Such code
|
|
change can affect the entire program rather than just replacing one
|
|
function. The same is true for adding malloc to your program. Making
|
|
a local change doesn't necessarily result in a local change in the
|
|
binary level. It can easily have cascading effects.
|
|
|
|
Some ELF fancy features make incremental linking even harder to
|
|
implement. Take the weak symbol as an example. If you define `atoi`
|
|
as a weak symbol in your program, and if you are not using `atoi`
|
|
at all in your program, that symbol will be resolved to address
|
|
0. But if you start using some libc function that indirectly calls
|
|
`atoi`, then `atoi` will be included in your program, and your weak
|
|
symbol will be resolved to that function. I don't know how to
|
|
efficiently fix up a binary for this case.
|
|
|
|
This is a hard problem, so existing linkers don't try too hard to
|
|
solve it. For example, IIRC, gold falls back to full link if any
|
|
function is removed from a previous build. If you want to not annoy
|
|
users in the fallback case, you need to make full link fast anyway.
|
|
|
|
Second, incremental linking itself has an overhead. It has to detect
|
|
updated files, patch an existing output file and write additional
|
|
data to an output file for future incremental linking. GNU gold, for
|
|
instance, takes almost 30 seconds on my machine to do a null
|
|
incremental link (i.e. no object files are updated from a previous
|
|
build) for chrome. It's just too slow.
|
|
|
|
Third, there are other practical issues in incremental linking. It's
|
|
not reproducible, so your binary isn't going to be the same as other
|
|
binaries even if you are compiling the same source tree using the
|
|
same compiler toolchain. Or, it is complex and there might be a bug
|
|
in it. If something doesn't work correctly, "remove --incremental
|
|
from your Makefile and try again" could be a piece of advice, but
|
|
that isn't ideal.
|
|
|
|
So, all in all, incremental linking is tricky. I wanted to make full
|
|
link as fast as possible, so that we don't have to think about how
|
|
to work around the slowness of full link.
|
|
|
|
- Defining a completely new file format and use it
|
|
|
|
Idea: Sometimes, the ELF file format itself seems to be a limiting
|
|
factor in improving the linker's performance. We might be able to make a
|
|
far better one if we create a new file format.
|
|
|
|
Reason for rejection: I rejected the idea because it apparently has
|
|
a practical issue (backward compatibility issue) and also doesn't
|
|
seem to improve the performance of linkers that much. As clearly
|
|
demonstrated by mold, we can create a fast linker for ELF. I believe
|
|
ELF isn't that bad, after all. The semantics of the existing Unix
|
|
linkers, such as the name resolution algorithm or the linker script,
|
|
have slowed the linkers down, but that's not a problem of the file
|
|
format itself.
|
|
|
|
- Watching object files using inotify(2)
|
|
|
|
Idea: When mold is running as a daemon for preloading, use
|
|
inotify(2) to watch file system updates so that it can reload files
|
|
as soon as they are updated.
|
|
|
|
Reason for rejection: Just like the maximum number of files you can
|
|
simultaneously open, the maximum number of files you can watch using
|
|
inotify(2) isn't that large. Maybe just a single instance of mold is
|
|
fine with inotify(2), but it may fail if you run multiple of it.
|
|
|
|
The other reason for not doing it is because mold is quite fast
|
|
without it anyway. Invoking stat(2) on each file for file update
|
|
check takes less than 100 milliseconds for Chrome, and if most of
|
|
the input files are not updated, parsing updated files takes almost
|
|
no time.
|