mirror of
https://github.com/rui314/mold.git
synced 2024-12-28 19:04:27 +03:00
617 lines
29 KiB
Markdown
617 lines
29 KiB
Markdown
# mold: A Modern Linker
|
|
|
|
![mold image](mold.jpg)
|
|
|
|
This is a repository of a linker I'm currently developing as a
|
|
replacement for existing Unix linkers such as GNU BFD, GNU gold or
|
|
LLVM lld.
|
|
|
|
My goal was to make a linker that is as fast as concatenating input
|
|
object files with `cat` command. It may sound like an impossible goal,
|
|
but it's not entirely impossible because of the following two reasons:
|
|
|
|
1. `cat` is a simple single-threaded program which isn't the fastest
|
|
one as a file copy command. My linker can use multiple threads to
|
|
copy file contents more efficiently to save time to do extra work.
|
|
|
|
2. Copying file contents is I/O-bounded, and many CPU cores should be
|
|
available during file copy. We can use them to do extra work while
|
|
copying file contents.
|
|
|
|
Concretely speaking, I wanted to use the linker to link a Chromium
|
|
executable with full debug info (~2 GiB in size) just in 1 second.
|
|
LLVM's lld, the fastest open-source linker which I originally created
|
|
a few years ago, takes about 12 seconds to link Chromium on my machine.
|
|
So the goal is 12x performance bump over lld. Compared to GNU gold,
|
|
it's more than 50x.
|
|
|
|
It looks like mold has achieved the goal. It can link Chromium in 2
|
|
seconds with 8-cores/16-threads, and if I enable the preloading
|
|
feature (I'll explain it later), the latency of the linker for an
|
|
interactive use is less than 900 milliseconds. It is actualy faster
|
|
than `cat`.
|
|
|
|
Note that even though mold can create a runnable Chrome executable,
|
|
it is far from complete and not usable for production. mold is still
|
|
just a toy linker, and this is still just my pet project.
|
|
|
|
## Performance visualization
|
|
|
|
Here is a side-by-side comparison of per-core CPU usage of lld (left)
|
|
and mold (right). They are linking the same program, Chromium
|
|
executable, which is ~2GB with debug info. As you can see, mold is
|
|
much faster than lld. Throughout its execution, mold uses all
|
|
available cores, which is artificially capped to 16 cores in this
|
|
demo.
|
|
|
|
![](htop.gif)
|
|
|
|
## How to build
|
|
|
|
mold is written in C++20, so you need a very recent version of GCC or
|
|
Clang. I'm using Ubuntu 20.04 as a development platform. In that
|
|
environment, you can build mold by the following commands.
|
|
|
|
```
|
|
$ sudo apt-get install build-essential libstdc++-10-dev clang-10 libssl-dev zlib1g-dev cmake
|
|
$ git clone --recursive https://github.com/rui314/mold.git
|
|
$ cd mold
|
|
$ make submodules
|
|
$ make
|
|
```
|
|
|
|
The last `make` command creates `mold` executable.
|
|
|
|
## How to use
|
|
|
|
On Unix, the linker command (which is usually `/usr/bin/ld`) is
|
|
invoked indirectly by `cc` (or `gcc` or `clang`), which is typically
|
|
in turn indirectly invoked by `make` or some other build system
|
|
command. It is sometimes very hard to pass an appropriate command line
|
|
option to `cc` to specify an alternative linker.
|
|
|
|
To deal with the situation, mold has a feature to intercept all
|
|
invocations of `/usr/bin/ld` and redirect it to itself. To use the
|
|
feature, run `make` (or other build command) as a subprocess of mold
|
|
as follows:
|
|
|
|
```
|
|
$ path/to/mold -run make <make-options-if-any>
|
|
```
|
|
|
|
Internally, mold invokes a given command with `LD_PRELOAD` environment
|
|
variable set to its companion shared object file. The shared object
|
|
file intercepts all function calls to exec-family functions and
|
|
`posix_spawn` to replace `argv[0]` with `mold` if it is `/usr/bin/ld`.
|
|
|
|
Alternatively, you can pass `-fuse-ld=<absolute-path-to-mold-executable>`
|
|
to a linker command line. Since GCC doesn't support that option,
|
|
I recommend using clang.
|
|
|
|
mold leaves its identification string in `.comment` section in an output
|
|
file. You can print it out to verify that you are actually using mold.
|
|
|
|
```
|
|
$ readelf -p .comment <executable-file>
|
|
|
|
String dump of section '.comment':
|
|
[ 0] GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
|
|
[ 2b] mold 9a1679b47d9b22012ec7dfbda97c8983956716f7
|
|
```
|
|
|
|
If `mold` is in `.comment`, the file is created by mold.
|
|
|
|
## Background
|
|
|
|
- Even though lld has significantly improved the situation, linking is
|
|
still one of the slowest steps in a build. It is especially
|
|
annoying when I changed one line of code and had to wait for a few
|
|
seconds or even more for a linker to complete. It should be
|
|
instantaneous. There's a need for a faster linker.
|
|
|
|
- The number of cores on a PC has increased a lot lately, and this
|
|
trend is expected to continue. However, the existing linkers can't
|
|
take the advantage of that because they don't scale well for more
|
|
cores. I have a 64-core/128-thread machine, so my goal is to create
|
|
a linker that uses the CPU nicely. mold should be much faster than
|
|
other linkers on 4 or 8-core machines too, though.
|
|
|
|
- It looks to me that the designs of the existing linkers are somewhat
|
|
too similar, and I believe there are a lot of drastically different
|
|
designs that haven't been explored yet. Developers generally don't
|
|
care about linkers as long as they work correctly, and they don't
|
|
even think about creating a new one. So there may be lots of low
|
|
hanging fruits there in this area.
|
|
|
|
## Basic design
|
|
|
|
- In order to achieve a `cat`-like performance, the most important
|
|
thing is to fix the layout of an output file as quickly as possible, so
|
|
that we can start copying actual data from input object files to an
|
|
output file as soon as possible.
|
|
|
|
- Copying data from input files to an output file is I/O-bounded, so
|
|
there should be room for doing computationally-intensive tasks while
|
|
copying data from one file to another.
|
|
|
|
- We should allow the linker to preload object files from disk and
|
|
parse them in memory before a complete set of input object files
|
|
is ready. My idea is this: if a user invokes the linker with
|
|
`--preload` flag along with other command line flags a few seconds
|
|
before the actual linker invocation, then the following actual
|
|
linker invocation with the same command line options (except
|
|
`--preload` flag) becomes magically faster. Behind the scenes, the
|
|
linker starts preloading object files on the first invocation and
|
|
becomes a daemon. The second invocation of the linker notifies the
|
|
daemon to reload updated object files and then proceed.
|
|
|
|
- Daemonizing alone wouldn't make the linker magically faster. We need
|
|
to split the linker into two in such a way that the latter half of
|
|
the process finishes as quickly as possible by speculatively parsing
|
|
and preprocessing input files in the first half of the process. The
|
|
key factor of success would be to design nice data structures that
|
|
allows us to offload as much processing as possible from the second
|
|
to the first half.
|
|
|
|
- One of the most time-consuming stage among linker stages is symbol
|
|
resolution. To resolve symbols, we basically have to throw all
|
|
symbol strings into a hash table to match undefined symbols with
|
|
defined symbols. But this can be done in the daemon using [string
|
|
interning](https://en.wikipedia.org/wiki/String_interning).
|
|
|
|
- Object files may contain a special section called a mergeable string
|
|
section. The section contains lots of null-terminated strings, and
|
|
the linker is expected to gather all mergeable string sections and
|
|
merge their contents. So, if two object files contain the same
|
|
string literal, for example, the resulting output will contain a
|
|
single merged string. This step is time-consuming, but string
|
|
merging can be done in the daemon using string interning.
|
|
|
|
- Static archives (.a files) contain object files, but the static
|
|
archive's string table contains only defined symbols of member
|
|
object files and lacks other types of symbols. That makes static
|
|
archives unsuitable for speculative parsing. The daemon should
|
|
ignore the string table of static archive and directly read all
|
|
member object files of all archives to get the whole picture of
|
|
all possible input files.
|
|
|
|
- If there's a relocation that uses a GOT of a symbol, then we have to
|
|
create a GOT entry for that symbol. Otherwise, we shouldn't. That
|
|
means we need to scan all relocation tables to fix the length and
|
|
the contents of a .got section. This is perhaps time-consuming, but
|
|
this step is parallelizable.
|
|
|
|
## Compatibility
|
|
|
|
- GNU ld, GNU gold and LLVM lld support essentially the same set of
|
|
command line options and features. mold doesn't have to be
|
|
completely compatible with them. As long as it can be used for
|
|
linking large user-land programs, I'm fine with that. It is OK to
|
|
leave some command line options unimplemented; if mold is blazingly
|
|
fast, other projects would still be happy to adopt it by modifying
|
|
their projects' build files.
|
|
|
|
- mold emits Linux executables and runs only on Linux. I won't avoid
|
|
Unix-ism when writing code (e.g. I'll probably use fork(2)).
|
|
I don't want to think about portability until mold becomes a thing
|
|
that's worth to be ported.
|
|
|
|
## Linker Script
|
|
|
|
Linker script is an embedded language for the linker. It is mainly
|
|
used to control how input sections are mapped to output sections and
|
|
the layout of the output, but it can also do a lot of tricky stuff.
|
|
Its feature is useful especially for embedded programming, but it's
|
|
also an awfully underdocumented and complex language.
|
|
|
|
We have to implement a subset of the linker script language anwyay,
|
|
because on Linux, /usr/lib/x86_64-linux-gnu/libc.so is (despite its
|
|
name) not a shared object file but actually an ASCII file containing
|
|
linker script code to load the _actual_ libc.so file. But the feature
|
|
set for this purpose is very limited, and it is okay to implement them
|
|
to mold.
|
|
|
|
Besides that, we really don't want to implement the linker script
|
|
langauge. But at the same time, we want to satisfy the user needs that
|
|
are currently satisfied with the linker script langauge. So, what
|
|
should we do? Here is my observation:
|
|
|
|
- Linker script allows to do a lot of tricky stuff, such as specifying
|
|
the exact layout of a file, inserting arbitrary bytes between
|
|
sections, etc. But most of them can be done with a post-link binary
|
|
editing tool (such as `objcopy`).
|
|
|
|
- It looks like there are two things that truely cannot be done by a
|
|
post-link editing tool: (a) mapping input sections to output
|
|
sections, and (b) applying relocations.
|
|
|
|
From the above observation, I believe we need to provide only the
|
|
following features instead of the entire linker script langauge:
|
|
|
|
- A method to specify how input sections are mapped to output
|
|
sections, and
|
|
|
|
- a method to set addresses to output sections, so that relocations
|
|
are applied based on desired adddresses.
|
|
|
|
I believe everything else can be done with a post-link binary editing
|
|
tool.
|
|
|
|
## Details
|
|
|
|
- If we aim to the 1 second goal for Chromium, every millisecond
|
|
counts. We can't ignore the latency of process exit. If we mmap a
|
|
lot of files, \_exit(2) is not instantaneous but takes a few hundred
|
|
milliseconds because the kernel has to clean up a lot of
|
|
resources. As a workaround, we should organize the linker command as
|
|
two processes; the first process forks the second process, and the
|
|
second process does the actual work. As soon as the second process
|
|
writes a result file to a filesystem, it notifies the first process,
|
|
and the first process exits. The second process can take time to
|
|
exit, because it is not an interactive process.
|
|
|
|
- At least on Linux, it looks like the filesystem's performance to
|
|
allocate new blocks to a new file is the limiting factor when
|
|
creating a new large file and filling its contents using mmap.
|
|
If you already have a large file on a filesystem, writing to it is
|
|
much faster than creating a new fresh file and writing to it.
|
|
Based on this observation, mold should overwrite to an existing
|
|
executable file if exists. My quick benchmark showed that I could
|
|
save 300 milliseconds when creating a 2 GiB output file.
|
|
Linux doesn't allow to open an executable for writing if it is
|
|
running (you'll get "text busy" error if you attempt). mold should
|
|
fall back to the usual way if it fails to open an output file.
|
|
|
|
- As an implementation strategy, we do _not_ care about memory leak
|
|
because we really can't save that much memory by doing precise
|
|
memory management. It is because most objects that are allocated
|
|
during an execution of mold are needed until the very end of the
|
|
program. I'm sure this is an odd memory management scheme (or the
|
|
lack thereof), but this is what LLVM lld does too.
|
|
|
|
- The output from the linker should be deterministic for the sake of
|
|
[build reproducibility](https://en.wikipedia.org/wiki/Reproducible_builds)
|
|
and ease of debugging. This might add a little bit of overhead to
|
|
the linker, but that shouldn't be too much.
|
|
|
|
- A .build-id, a unique ID embedded to an output file, is usually
|
|
computed by applying a cryptographic hash function (e.g. SHA-1) to
|
|
an output file. This is a slow step, but we can speed it up by
|
|
splitting a file into small chunks, computing SHA-1 for each chunk,
|
|
and then computing SHA-1 of the concatenated SHA-1 hashes
|
|
(i.e. constructing a [Markle
|
|
Tree](https://en.wikipedia.org/wiki/Merkle_tree) of height 2).
|
|
Modern x86 processors have purpose-built instructions for SHA-1 and
|
|
can compute SHA-1 pretty quickly at about 2 GiB/s rate. Using 16
|
|
cores, a build-id for a 2 GiB executable can be computed in 60 to 70
|
|
milliseconds.
|
|
|
|
- BFD, gold, and lld support section garbage collection. That is, a
|
|
linker runs a mark-sweep garbage collection on an input graph, where
|
|
sections are vertices and relocations are edges, to discard all
|
|
sections that are not reachable from the entry point symbol
|
|
(i.e. `_start`) or a few other root sections. In mold, we are using
|
|
multiple threads to mark sections concurrently.
|
|
|
|
- Similarly, BFD, gold an lld support Identical Comdat Folding (ICF)
|
|
as a yet another size optimization. ICF merges two or more read-only
|
|
sections that happen to have the same contents and relocations.
|
|
To do that, we have to find isomorphic subgraphs from larger graphs.
|
|
I implemented a new algorithm for mold, which is 5x faster than lld
|
|
to do ICF for Chromium (from 5 seconds to 1 second).
|
|
|
|
- [Intel Threading Building
|
|
Blocks](https://github.com/oneapi-src/oneTBB) (TBB) is a good
|
|
library for parallel execution and has several concurrent
|
|
containers. We are particularly interested in using
|
|
`parallel_for_each` and `concurrent_hash_map`.
|
|
|
|
- TBB provides `tbbmalloc` which works better for multi-threaded
|
|
applications than the glib'c malloc, but it looks like
|
|
[jemalloc](https://github.com/jemalloc/jemalloc) and
|
|
[mimalloc](https://github.com/microsoft/mimalloc) are a little bit
|
|
more scalable than `tbbmalloc`.
|
|
|
|
## Size of the problem
|
|
|
|
When linking Chrome, a linker reads 3,430,966,844 bytes of data in
|
|
total. The data contains the following items:
|
|
|
|
| Data item | Number
|
|
| ------------------------ | ------
|
|
| Object files | 30,723
|
|
| Public undefined symbols | 1,428,149
|
|
| Mergeable strings | 1,579,996
|
|
| Comdat groups | 9,914,510
|
|
| Regular sections¹ | 10,345,314
|
|
| Public defined symbols | 10,512,135
|
|
| Symbols | 23,953,607
|
|
| Sections | 27,543,225
|
|
| Relocations against SHF_ALLOC sections | 39,496,375
|
|
| Relocations | 62,024,719
|
|
|
|
¹ Sections that have to be copied from input object files to an
|
|
output file. Sections that contain relocations or symbols are for
|
|
example excluded.
|
|
|
|
## Internals
|
|
|
|
In this section, I'll explain the internals of mold linker.
|
|
|
|
### A brief history of Unix and the Unix linker
|
|
|
|
Conceptually, what a linker does is pretty simple. A compiler compiles
|
|
a fragment of a program (a single source file) into a fragment of
|
|
machine code and data (an object file, which typically has the .o
|
|
extension), and a linker stiches them together into a single
|
|
executable or a shared library image.
|
|
|
|
In reality, modern linkers for Unix-like systems are much more
|
|
compilcated than the naive understanding because they have gradually
|
|
gained one feature at a time over the 50 years history of Unix, and
|
|
they are now something like a bag of lots of miscellaneous features in
|
|
which none of the features is more important than the others. It is
|
|
very easy to miss the forest for the trees, since for those who don't
|
|
know the details of the Unix linker, it is not clear which feature is
|
|
essential and which is not.
|
|
|
|
That being said, one thing is clear that at any point of Unix history,
|
|
a Unix linker has a coherent feature set for the Unix of that age. So,
|
|
let me entangle the history to see how the operating system, runtime
|
|
and linker have gained features that we see today. That should give
|
|
you an idea why a particular feature has been added to a linker in the
|
|
first place.
|
|
|
|
1. Original Unix didn't support shared library, and a program was
|
|
always loaded to a fixed address. An executable was something like
|
|
a memory dump which was just loaded to a particular address by the
|
|
kernel. After loading, the kernel started executing the program by
|
|
setting the instruction pointer to a particular address.
|
|
|
|
The most essential feature for any linker is relocation processing.
|
|
The original Unix linker of course supported that. Let me explain
|
|
what that is.
|
|
|
|
Individual object files are inevitably incomplete as a program,
|
|
because when a compiler created them, it only see a part of an
|
|
entire program. For example, if an object file contains a function
|
|
call that refers other object file, the `call` instruction in the
|
|
object cannot be complete, as the compiler has no idea as to what
|
|
is the called function's address. To deal with this, the compiler
|
|
emits a placeholder value (typically just zero) instead of a real
|
|
address and leave a metadata in an object file saying "fix offset X
|
|
of this file with an address of Y". That metadata is called
|
|
"relocation". Relocations are typically processed by the linker.
|
|
|
|
It is easy for a linker to apply relocations for the original Unix
|
|
because a program is always loaded to a fixed address. It exactly
|
|
knows the addresses of all functions and data when linking a
|
|
program.
|
|
|
|
Static library support, which is still an important feature of Unix
|
|
linker, also dates back to this early period of Unix history.
|
|
To understand what it is, imagine that you are trying to compile
|
|
a program for the early Unix. You don't want to waste time to
|
|
compile libc functions every time you compile your program (the
|
|
computers of the era was incredibly slow), so you have already
|
|
placed each libc function into a separate source file and compiled
|
|
them individually. That means, you have object files for each libc
|
|
function, e.g., printf.o, scanf.o, atoi.o, write.o, etc.
|
|
|
|
Given this configuration, all you have to do to link your program
|
|
against libc functions is to pick up a right set of libc object
|
|
files and give them to the linker along with the object files of your
|
|
program. But, keeping the linker command line in sync with the
|
|
libc functions you are using in your program is bothersome. You can
|
|
be conservative; you can specify all libc object files to the
|
|
command line, but that leads to program bloat because the linker
|
|
unconditionally link all object files given to it no matter whether
|
|
they are used or not. So, a new feature was added to the linker to
|
|
fix the problem. That is the static library, which is also called
|
|
the archive file.
|
|
|
|
An archive file is just a bundle of object files, just like zip
|
|
file but in an uncompressed form. An achive file typically has the
|
|
.a file extension and named after its contents. For example, the
|
|
archive file containing all libc objects is named `libc.a`.
|
|
|
|
If you pass an archive file along with other object files to the
|
|
linker, the linker pulls out an object file from the archive _only
|
|
when_ it is referenced by other object files. In other words,
|
|
unlike object files directly given to a linker, object files
|
|
wrapped in an archive are not linked to an output by default.
|
|
An archive works as supplements to complete your program.
|
|
|
|
Even today, you can still find a libc archive file. Run `ar t
|
|
/usr/lib/x86_64-linux-gnu/libc.a` on Linux should give you a list
|
|
of object files in the libc archive.
|
|
|
|
2. In '80s, Sun Microsystems, a leading commercial Unix vendor at the
|
|
time, added a shared library support to their Unix variant, SunOS.
|
|
|
|
(This section is incomplete.)
|
|
|
|
## Concurrency strategy
|
|
|
|
In this section, I'll explain the high level concurrency strategy of
|
|
mold.
|
|
|
|
In most places, mold adopts data parallelism. That is, we have a huge
|
|
number of piece of data of the same kind, and we process each of them
|
|
individually using parallel for-loop. For example, after identifying
|
|
the exact set of input object files, we need to scan all relocation
|
|
tables to determine the sizes of .got and .plt sections. We do that
|
|
using a parallel for-loop. The granularity of parallel processing in
|
|
this case is the relocation table.
|
|
|
|
Data parallelism is very efficient and scalable because there's no
|
|
need for threads to communicate with each other while working on each
|
|
element of data. In addition to that, data parallelism is easy to
|
|
understand, as it is just a for-loop in which multiple iterations may
|
|
be executed in parallel. We don't use high-level communication or
|
|
synchronization mechanisms such as channels, futures, promises,
|
|
latches or something like that in mold.
|
|
|
|
In some cases, we need to share a little bit of data between threads
|
|
while executing a parallel for-loop. For example, the loop to scan
|
|
relocations turns on "requires GOT" or "requires PLT" flags in a
|
|
symbol. Symbol is a shared resource, and writing to them from multiple
|
|
threads without synchronization is unsafe. To deal with it, we made
|
|
the flag an atomic variable.
|
|
|
|
The other common pattern you can find in mold which is build on top of
|
|
the parallel for-loop is the map-reduce pattern. That is, we run a
|
|
parallel for-loop on a large data set to produce a small data set and
|
|
process the small data set with a single thread. Let me take a
|
|
build-id computation as an example. Build-id is typically computed by
|
|
applying a cryptographic hash function such as SHA-1 on a linker's
|
|
output file. To compute it, we first consider an output as a sequence
|
|
of 1 MiB blocks and compute a SHA-1 hash for each block in parallel.
|
|
Then, we concatenate the SHA-1 hashes and compute a SHA-1 hash on the
|
|
hashes to get a final build-id.
|
|
|
|
Finally, we use concurrent hashmap at a few places in mold. Concurrent
|
|
hashmap is a hashmap to which multiple threads can safely insert items
|
|
in parallel. We use it in the symbol resolution stage, for example.
|
|
To resolve symbols, we basically have to throw in all defined symbols
|
|
into a hash table, so that we can find a matching defined symbol for
|
|
an undefined symbol by name. We do the hash table insertion from a
|
|
parallel for-loop which iterates over a list of input files.
|
|
|
|
Overall, even though mold is highly scalable, it succeeded to avoid
|
|
complexties you often find in complex parallel programs. From high
|
|
level, mold just serially executes linker's internal passes one by
|
|
one. Each pass is parallelized using parallel for-loops.
|
|
|
|
## Rejected ideas
|
|
|
|
In this section, I'll explain the alternative designs I currently do
|
|
not plan to implement and why I turned them down.
|
|
|
|
- Placing variable-length sections at end of an output file and start
|
|
copying file contents before fixing the output file layout
|
|
|
|
Idea: Fixing the layout of regular sections seems easy, and if we
|
|
place them at beginning of a file, we can start copying their
|
|
contents from their input files to an output file. While copying
|
|
file contents, we can compute the sizes of variable-length sections
|
|
such as .got or .plt and place them at end of the file.
|
|
|
|
Reason for rejection: I did not choose this design because I doubt
|
|
if it could actually shorten link time and I think I don't need it
|
|
anyway.
|
|
|
|
The linker has to de-duplicate comdat sections (i.e. inline
|
|
functions that are included into multiple object files), so we
|
|
cannot compute the layout of regular sections until we resolve all
|
|
symbols and de-duplicate comdats. That takes a few hundred
|
|
milliseconds. After that, we can compute the sizes of
|
|
variable-length sections in less than 100 milliseconds. It's quite
|
|
fast, so it doesn't seem to make much sense to proceed without
|
|
fixing the final file layout.
|
|
|
|
The other reason to reject this idea is because there's good a
|
|
chance for this idea to have a negative impact on linker's overall
|
|
performance. If we copy file contents before fixing the layout, we
|
|
can't apply relocations to them while copying because symbol
|
|
addresses are not available yet. If we fix the file layout first, we
|
|
can apply relocations while copying, which is effectively zero-cost
|
|
due to a very good data locality. On the other hand, if we apply
|
|
relocations long after we copy file contents, it's pretty expensive
|
|
because section contents are very likely to have been evicted from
|
|
CPU cache.
|
|
|
|
- Incremental linking
|
|
|
|
Idea: Incremental linking is a technique to patch a previous linker's
|
|
output file so that only functions or data that are updated from the
|
|
previous build are written to it. It is expected to significantly
|
|
reduce the amount of data copied from input files to an output file
|
|
and thus speed up linking. GNU BFD and gold linkers support it.
|
|
|
|
Reason for rejection: I turned it down because it (1) is
|
|
complicated, (2) doesn't seem to speed it up that much and (3) has
|
|
several practical issues. Let me explain each of them.
|
|
|
|
First, incremental linking for real C/C++ programs is not as easy as
|
|
one might think. Let me take malloc as an example. malloc is usually
|
|
defined by libc, but you can implement it in your program, and if
|
|
that's the case, the symbol `malloc` will be resolved to your
|
|
function instead of the one in libc. If you include a library that
|
|
defines malloc (such as libjemalloc or libtbbmallc) before libc,
|
|
their malloc will override libc's malloc.
|
|
|
|
Assume that you are using a nonstandard malloc. What if you remove
|
|
your malloc from your code, or remove `-ljemalloc` from your
|
|
Makefile? The linker has to include a malloc from libc, which may
|
|
include more object files to satisfy its dependencies. Such code
|
|
change can affect the entire program rather than just replacing one
|
|
function. The same is true to adding malloc to your program. Making
|
|
a local change doesn't necessarily result in a local change in the
|
|
binary level. It can easily have cascading effects.
|
|
|
|
Some ELF fancy features make incremental linking even harder to
|
|
implement. Take the weak symbol as an example. If you define `atoi`
|
|
as a weak symbol in your program, and if you are not using `atoi`
|
|
at all in your program, that symbol will be resolved to address
|
|
0. But if you start using some libc function that indirectly calls
|
|
`atoi`, then `atoi` will be included to your program, and your weak
|
|
symbol will be resolved to that function. I don't know how to
|
|
efficiently fix up a binary for this case.
|
|
|
|
This is a hard problem, so existing linkers don't try too hard to
|
|
solve it. For example, IIRC, gold falls back to full link if any
|
|
function is removed from a previous build. If you want to not annoy
|
|
users in the fallback case, you need to make full link fast anyway.
|
|
|
|
Second, incremental linking itself has an overhead. It has to detect
|
|
updated files, patch an existing output file and write additional
|
|
data to an output file for future incremental linking. GNU gold, for
|
|
instance, takes almost 30 seconds on my machine to do a null
|
|
incremental link (i.e. no object files are updated from a previous
|
|
build) for chrome. It's just too slow.
|
|
|
|
Third, there are other practical issues in incremental linking. It's
|
|
not reproducible, so your binary isn't going to be the same as other
|
|
binaries even if you are compiling the same source tree using the
|
|
same compiler toolchain. Or, it is complex and there might be a bug
|
|
in it. If something doesn't work correctly, "remove --incremental
|
|
from your Makefile and try again" could be a piece of advise, but
|
|
that isn't ideal.
|
|
|
|
So, all in all, incremental linking is tricky. I wanted to make full
|
|
link as fast as possible, so that we don't have to think about how
|
|
to workaround the slowness of full link.
|
|
|
|
- Defining a completely new file format and use it
|
|
|
|
Idea: Sometimes, the ELF file format itself seems to be a limiting
|
|
factor of improving linker's performance. We might be able to make a
|
|
far better one if we create a new file format.
|
|
|
|
Reason for rejection: I rejected the idea because it apparently has
|
|
a practical issue (backward compatibility issue) and also doesn't
|
|
seem to improve performance of linkers that much. As clearly
|
|
demonstrated by mold, we can create a fast linker for ELF. I believe
|
|
ELF isn't that bad, after all. The semantics of the existing Unix
|
|
linkers, such as the name resolution algorithm or the linker script,
|
|
have slowed the linkers down, but that's not a problem of the file
|
|
format itself.
|
|
|
|
- Watching object files using inotify(2)
|
|
|
|
Idea: When mold is running as a daemon for preloading, use
|
|
inotify(2) to watch file system updates so that it can reload files
|
|
as soon as they are updated.
|
|
|
|
Reason for rejection: Just like the maximum number of files you can
|
|
simultaneously open, the maximum number of files you can watch using
|
|
inotify(2) isn't that large. Maybe just a single instance of mold is
|
|
fine with inotify(2), but it may fail if you run multiple of it.
|
|
|
|
The other reason for not doing it is because mold is quite fast
|
|
without it anyway. Invoking stat(2) on each file for file update
|
|
check takes less than 100 milliseconds for Chrome, and if most of
|
|
the input files are not updated, parsing updated files takes almost
|
|
no time.
|