2020-10-21 08:55:50 +03:00
|
|
|
# mold: A Modern Linker
|
2020-10-21 06:52:11 +03:00
|
|
|
|
2020-10-21 08:55:50 +03:00
|
|
|
![mold image](mold.jpg)
|
2020-10-21 06:52:11 +03:00
|
|
|
|
2020-10-21 08:55:50 +03:00
|
|
|
This is a repository of a linker I'm currently developing as an
|
|
|
|
independent project for my Masters degree.
|
|
|
|
|
2020-11-02 14:57:08 +03:00
|
|
|
My goal is to make a linker that is as fast as concatenating input
|
|
|
|
object files with `cat` command. It may sound like an impossible goal,
|
|
|
|
but I believe it's not entirely impossible because of the following
|
|
|
|
two reasons:
|
|
|
|
|
|
|
|
1. `cat` is a simple single-threaded program which isn't the fastest
|
|
|
|
one as a file copy command. My linker can use multiple threads to
|
|
|
|
copy file contents more efficiently to save time to do extra work.
|
|
|
|
|
|
|
|
2. Copying file contents is I/O-bounded, and many CPU cores should be
|
|
|
|
available during file copy. We can use them to do extra work while
|
|
|
|
copying file contents.
|
|
|
|
|
|
|
|
Concretely speaking, I want to use the linker to link a Chromium
|
|
|
|
executable (~1.8 GiB in size) just in 1 second. LLVM's lld, the
|
|
|
|
fastest open-source linker which I originally created a few years ago,
|
|
|
|
takes about 12 seconds to link Chromium on my machine. So the goal is
|
|
|
|
12x performance bump over lld. Compared to GNU gold, it's more than
|
|
|
|
50x.
|
|
|
|
|
|
|
|
I don't know if I can ever achieve that, but it's worth a try. I need
|
|
|
|
to create something anyway to earn units to graduate, and I want to
|
|
|
|
(at least try to) create something useful.
|
2020-10-21 08:55:50 +03:00
|
|
|
|
|
|
|
I have quite a few new ideas as to how to achieve that speedup, though
|
|
|
|
they are still just random unproved thoughts which need to be
|
|
|
|
implemented and tested with benchmarks. Here is a brain dump:
|
|
|
|
|
2020-10-22 07:52:34 +03:00
|
|
|
## Background
|
|
|
|
|
|
|
|
- Even though lld has significantly improved the situation, linking is
|
2020-10-26 16:22:06 +03:00
|
|
|
still one of the slowest steps in a build. It is especially
|
2020-10-22 07:52:34 +03:00
|
|
|
annoying when I changed one line of code and had to wait for a few
|
|
|
|
seconds or even more for a linker to complete. It should be
|
|
|
|
instantaneous. There's a need for a faster linker.
|
|
|
|
|
|
|
|
- The number of cores on a PC has increased a lot lately, and this
|
|
|
|
trend is expected to continue. However, the existing linkers can't
|
2020-10-26 16:26:44 +03:00
|
|
|
take the advantage of that because they don't scale well for more
|
2020-10-22 07:52:34 +03:00
|
|
|
cores. I have a 64-core/128-thread machine, so my goal is to create
|
|
|
|
a linker that uses the CPU nicely. mold should be much faster than
|
|
|
|
other linkers on 4 or 8-core machines too, though.
|
|
|
|
|
|
|
|
- It looks to me that the designs of the existing linkers are somewhat
|
|
|
|
similar, and I believe there are a lot of drastically different
|
|
|
|
designs that haven't been explored yet. Develoeprs generally don't
|
|
|
|
care about linkers as long as they work correctly, and they don't
|
|
|
|
even think about creating a new one. So there may be lots of low
|
|
|
|
hanging fruits there in this area.
|
|
|
|
|
|
|
|
## Basic design
|
|
|
|
|
2020-10-21 08:55:50 +03:00
|
|
|
- In order to achieve a `cat`-like performance, the most important
|
2020-10-21 10:51:40 +03:00
|
|
|
thing is to fix the layout of an output file as quickly as possible, so
|
2020-10-21 08:55:50 +03:00
|
|
|
that we can start copying actual data from input object files to an
|
2020-10-26 16:23:44 +03:00
|
|
|
output file as soon as possible.
|
2020-10-21 08:55:50 +03:00
|
|
|
|
|
|
|
- Copying data from input files to an output file is I/O-bounded, so
|
|
|
|
there should be room for doing computationally-intensive tasks while
|
|
|
|
copying data from one file to another.
|
|
|
|
|
|
|
|
- After the first invocation of the linker, the linker should not exit
|
|
|
|
but instead become a daemon to keep parsed input files in memory.
|
2020-10-26 18:27:55 +03:00
|
|
|
Subsequent linker invocations for the same output file make the
|
|
|
|
linker daemon to reload updated input files, and then the daemon
|
|
|
|
calls fork(2) to create a subprocess and let it do the actual linking.
|
2020-10-21 08:55:50 +03:00
|
|
|
|
|
|
|
- Daemonizing alone wouldn't make the linker magically faster. We need
|
|
|
|
to split the linker into two in such a way that the latter half of
|
|
|
|
the process finishes as quickly as possible by speculatively parsing
|
2020-10-21 10:51:40 +03:00
|
|
|
and preprocessing input files in the first half of the process. The
|
2020-10-22 07:52:34 +03:00
|
|
|
key factor of success would be to design nice data structures that
|
|
|
|
allows us to offload as much processing as possible from the second
|
|
|
|
to the first half.
|
2020-10-21 08:55:50 +03:00
|
|
|
|
|
|
|
- One of the most time-consuming stage among linker stages is symbol
|
|
|
|
resolution. To resolve symbols, we basically have to throw all
|
|
|
|
symbol strings into a hash table to match undefined symbols with
|
2020-10-26 16:24:51 +03:00
|
|
|
defined symbols. But this can be done in the daemon using [string
|
2020-10-21 08:55:50 +03:00
|
|
|
interning](https://en.wikipedia.org/wiki/String_interning).
|
|
|
|
|
|
|
|
- Object files may contain a special section called a mergeable string
|
|
|
|
section. The section contains lots of null-terminated strings, and
|
|
|
|
the linker is expected to gather all mergeable string sections and
|
|
|
|
merge their contents. So, if two object files contain the same
|
|
|
|
string literal, for example, the resulting output will contain a
|
|
|
|
single merged string. This step is time-consuming, but string
|
|
|
|
merging can be done in the daemon using string interning.
|
|
|
|
|
|
|
|
- Static archives (.a files) contain object files, but the static
|
|
|
|
archive's string table contains only defined symbols of member
|
|
|
|
object files and lacks other types of symbols. That makes static
|
|
|
|
archives unsuitable for speculative parsing. The daemon should
|
|
|
|
ignore the string table of static archive and directly read all
|
2020-10-21 10:51:40 +03:00
|
|
|
member object files of all archives to get the whole picture of
|
|
|
|
all possible input files.
|
2020-10-21 08:55:50 +03:00
|
|
|
|
|
|
|
- If there's a relocation that uses a GOT of a symbol, then we have to
|
|
|
|
create a GOT entry for that symbol. Otherwise, we shouldn't. That
|
|
|
|
means we need to scan all relocation tables to fix the length and
|
|
|
|
the contents of a .got section. This is perhaps time-consuming, but
|
|
|
|
we can do that while copying data from input files to an output
|
|
|
|
file. After the data copy is done, we can attach a .got section at
|
|
|
|
the end of the output file.
|
|
|
|
|
2020-10-22 07:52:34 +03:00
|
|
|
- Many linkers support incremental linking, but I think that's a hack
|
|
|
|
to work around the slowness of regular linking. I want to focus on
|
|
|
|
making the regular linking extremely fast.
|
|
|
|
|
|
|
|
## Compatibility
|
2020-10-21 08:55:50 +03:00
|
|
|
|
|
|
|
- GNU ld, GNU gold and LLVM lld support essentially the same set of
|
|
|
|
command line options and features. mold doesn't have to be
|
|
|
|
completely compatible with them. As long as it can be used for
|
|
|
|
linking large user-land programs, I'm fine with that. It is OK to
|
|
|
|
leave some command line options unimplemented; if mold is blazingly
|
2020-10-22 07:52:34 +03:00
|
|
|
fast, other projects would still be happy to adopt it by modifying
|
2020-10-21 08:55:50 +03:00
|
|
|
their projects' build files.
|
|
|
|
|
|
|
|
- I don't want to support the linker script language in mold because
|
|
|
|
it's so complicated and inevitably slows down the linker. User-land
|
|
|
|
programs rarely use linker scripts, so it shouldn't be a roadblock
|
|
|
|
for most projects.
|
|
|
|
|
2020-10-22 14:25:35 +03:00
|
|
|
- mold emits Linux executables and runs only on Linux. I won't avoid
|
|
|
|
Unix-ism when writing code (e.g. I'll probably use fork(2)).
|
|
|
|
I don't want to think about portability until mold becomes a thing
|
|
|
|
that's worth to be ported.
|
2020-10-22 07:52:34 +03:00
|
|
|
|
|
|
|
## Details
|
|
|
|
|
2020-11-02 15:05:36 +03:00
|
|
|
- If we aim to the 1 second goal for Chromium, every millisecond
|
2020-10-22 07:52:34 +03:00
|
|
|
counts. We can't ignore the latency of process exit. If we mmap a
|
|
|
|
lot of files, \_exit(2) is not instantaneous but takes a few hundred
|
|
|
|
milliseconds because the kernel has to clean up a lot of
|
|
|
|
resources. As a workaround, we should organize the linker command as
|
|
|
|
two processes; the first process forks the second process, and the
|
|
|
|
second process does the actual work. As soon as the second process
|
|
|
|
writes a result file to a filesystem, it notifies the first process,
|
|
|
|
and the first process exits. The second process can take time to
|
|
|
|
exit, because it is not an interactive process.
|
|
|
|
|
2020-11-02 15:06:44 +03:00
|
|
|
- The output from the linker should be deterministic for the sake of
|
|
|
|
[build reproducibility](https://en.wikipedia.org/wiki/Reproducible_builds)
|
|
|
|
and ease of debugging. This might add a little bit of overhead to
|
|
|
|
the linker, but that shouldn't be too much.
|
|
|
|
|
2020-10-24 16:05:40 +03:00
|
|
|
- A .build-id, a unique ID embedded to an output file, is usually
|
|
|
|
computed by applying a cryptographic hash function (e.g. SHA-1) to
|
|
|
|
an output file. But it adds an extra time for linking because a
|
|
|
|
linker has to compute a SHA-1 checksum after the actual linking is
|
2020-10-26 16:26:44 +03:00
|
|
|
done. We should instead compute a SHA-1 for the tuple of (all input
|
2020-11-02 15:06:44 +03:00
|
|
|
files, command line options, linker version) as a build-id, as it
|
|
|
|
should uniquely identify the output.
|
2020-10-24 16:05:40 +03:00
|
|
|
|
2020-10-21 08:55:50 +03:00
|
|
|
- [Intel Threading Building
|
|
|
|
Blocks](https://github.com/oneapi-src/oneTBB) (TBB) is a good
|
|
|
|
library for parallel execution and has several concurrent
|
|
|
|
containers. We are particularly interested in using
|
|
|
|
`parallel_for_each` and `concurrent_hash_map`.
|
|
|
|
|
2020-10-22 19:14:11 +03:00
|
|
|
## Size of the problem
|
|
|
|
|
|
|
|
When linking Chrome, a linker reads 3,430,966,844 bytes of data in
|
|
|
|
total. The data contains the following items:
|
|
|
|
|
2020-10-23 07:17:21 +03:00
|
|
|
| Data item | Number
|
|
|
|
| ------------------------ | ------
|
|
|
|
| Object files | 30,723
|
2020-10-26 16:26:44 +03:00
|
|
|
| Public undefined symbols | 1,428,149
|
|
|
|
| Mergeable strings | 1,579,996
|
|
|
|
| Comdat groups | 9,914,510
|
|
|
|
| Regular sections (*1) | 10,345,314
|
|
|
|
| Public defined symbols | 10,512,135
|
|
|
|
| Symbols | 23,953,607
|
|
|
|
| Sections | 27,543,225
|
|
|
|
| Relocations against SHF_ALLOC sections | 39,496,375
|
|
|
|
| Relocations | 62,024,719
|
2020-10-23 07:23:12 +03:00
|
|
|
|
|
|
|
(*1) Sections that have to be copied from input object files to an
|
|
|
|
output file. Sections that contain relocations or symbols are for
|
2020-10-24 09:59:38 +03:00
|
|
|
example excluded.
|