1
1
mirror of https://github.com/rui314/mold.git synced 2024-12-27 10:23:41 +03:00
mold/README.md

190 lines
8.8 KiB
Markdown
Raw Normal View History

2020-10-21 08:55:50 +03:00
# mold: A Modern Linker
2020-10-21 06:52:11 +03:00
2020-10-21 08:55:50 +03:00
![mold image](mold.jpg)
2020-10-21 06:52:11 +03:00
2020-10-21 08:55:50 +03:00
This is a repository of a linker I'm currently developing as an
independent project for my Masters degree.
My goal is to make a linker that is as fast as concatenating input
object files with `cat` command. It may sound like an impossible goal,
but I believe it's not entirely impossible because of the following
two reasons:
1. `cat` is a simple single-threaded program which isn't the fastest
one as a file copy command. My linker can use multiple threads to
copy file contents more efficiently to save time to do extra work.
2. Copying file contents is I/O-bounded, and many CPU cores should be
available during file copy. We can use them to do extra work while
copying file contents.
Concretely speaking, I want to use the linker to link a Chromium
executable (~1.8 GiB in size) just in 1 second. LLVM's lld, the
fastest open-source linker which I originally created a few years ago,
takes about 12 seconds to link Chromium on my machine. So the goal is
12x performance bump over lld. Compared to GNU gold, it's more than
50x.
I don't know if I can ever achieve that, but it's worth a try. I need
to create something anyway to earn units to graduate, and I want to
(at least try to) create something useful.
2020-10-21 08:55:50 +03:00
I have quite a few new ideas as to how to achieve that speedup, though
they are still just random unproved thoughts which need to be
implemented and tested with benchmarks. Here is a brain dump:
2020-10-22 07:52:34 +03:00
## Background
- Even though lld has significantly improved the situation, linking is
2020-10-26 16:22:06 +03:00
still one of the slowest steps in a build. It is especially
2020-10-22 07:52:34 +03:00
annoying when I changed one line of code and had to wait for a few
seconds or even more for a linker to complete. It should be
instantaneous. There's a need for a faster linker.
- The number of cores on a PC has increased a lot lately, and this
trend is expected to continue. However, the existing linkers can't
2020-10-26 16:26:44 +03:00
take the advantage of that because they don't scale well for more
2020-10-22 07:52:34 +03:00
cores. I have a 64-core/128-thread machine, so my goal is to create
a linker that uses the CPU nicely. mold should be much faster than
other linkers on 4 or 8-core machines too, though.
- It looks to me that the designs of the existing linkers are somewhat
similar, and I believe there are a lot of drastically different
designs that haven't been explored yet. Develoeprs generally don't
care about linkers as long as they work correctly, and they don't
even think about creating a new one. So there may be lots of low
hanging fruits there in this area.
## Basic design
2020-10-21 08:55:50 +03:00
- In order to achieve a `cat`-like performance, the most important
2020-10-21 10:51:40 +03:00
thing is to fix the layout of an output file as quickly as possible, so
2020-10-21 08:55:50 +03:00
that we can start copying actual data from input object files to an
2020-10-26 16:23:44 +03:00
output file as soon as possible.
2020-10-21 08:55:50 +03:00
- Copying data from input files to an output file is I/O-bounded, so
there should be room for doing computationally-intensive tasks while
copying data from one file to another.
- After the first invocation of the linker, the linker should not exit
but instead become a daemon to keep parsed input files in memory.
2020-10-26 18:27:55 +03:00
Subsequent linker invocations for the same output file make the
linker daemon to reload updated input files, and then the daemon
calls fork(2) to create a subprocess and let it do the actual linking.
2020-10-21 08:55:50 +03:00
- Daemonizing alone wouldn't make the linker magically faster. We need
to split the linker into two in such a way that the latter half of
the process finishes as quickly as possible by speculatively parsing
2020-10-21 10:51:40 +03:00
and preprocessing input files in the first half of the process. The
2020-10-22 07:52:34 +03:00
key factor of success would be to design nice data structures that
allows us to offload as much processing as possible from the second
to the first half.
2020-10-21 08:55:50 +03:00
- One of the most time-consuming stage among linker stages is symbol
resolution. To resolve symbols, we basically have to throw all
symbol strings into a hash table to match undefined symbols with
2020-10-26 16:24:51 +03:00
defined symbols. But this can be done in the daemon using [string
2020-10-21 08:55:50 +03:00
interning](https://en.wikipedia.org/wiki/String_interning).
- Object files may contain a special section called a mergeable string
section. The section contains lots of null-terminated strings, and
the linker is expected to gather all mergeable string sections and
merge their contents. So, if two object files contain the same
string literal, for example, the resulting output will contain a
single merged string. This step is time-consuming, but string
merging can be done in the daemon using string interning.
- Static archives (.a files) contain object files, but the static
archive's string table contains only defined symbols of member
object files and lacks other types of symbols. That makes static
archives unsuitable for speculative parsing. The daemon should
ignore the string table of static archive and directly read all
2020-10-21 10:51:40 +03:00
member object files of all archives to get the whole picture of
all possible input files.
2020-10-21 08:55:50 +03:00
- If there's a relocation that uses a GOT of a symbol, then we have to
create a GOT entry for that symbol. Otherwise, we shouldn't. That
means we need to scan all relocation tables to fix the length and
the contents of a .got section. This is perhaps time-consuming, but
we can do that while copying data from input files to an output
file. After the data copy is done, we can attach a .got section at
the end of the output file.
2020-10-22 07:52:34 +03:00
- Many linkers support incremental linking, but I think that's a hack
to work around the slowness of regular linking. I want to focus on
making the regular linking extremely fast.
## Compatibility
2020-10-21 08:55:50 +03:00
- GNU ld, GNU gold and LLVM lld support essentially the same set of
command line options and features. mold doesn't have to be
completely compatible with them. As long as it can be used for
linking large user-land programs, I'm fine with that. It is OK to
leave some command line options unimplemented; if mold is blazingly
2020-10-22 07:52:34 +03:00
fast, other projects would still be happy to adopt it by modifying
2020-10-21 08:55:50 +03:00
their projects' build files.
- I don't want to support the linker script language in mold because
it's so complicated and inevitably slows down the linker. User-land
programs rarely use linker scripts, so it shouldn't be a roadblock
for most projects.
2020-10-22 14:25:35 +03:00
- mold emits Linux executables and runs only on Linux. I won't avoid
Unix-ism when writing code (e.g. I'll probably use fork(2)).
I don't want to think about portability until mold becomes a thing
that's worth to be ported.
2020-10-22 07:52:34 +03:00
## Details
2020-11-02 15:05:36 +03:00
- If we aim to the 1 second goal for Chromium, every millisecond
2020-10-22 07:52:34 +03:00
counts. We can't ignore the latency of process exit. If we mmap a
lot of files, \_exit(2) is not instantaneous but takes a few hundred
milliseconds because the kernel has to clean up a lot of
resources. As a workaround, we should organize the linker command as
two processes; the first process forks the second process, and the
second process does the actual work. As soon as the second process
writes a result file to a filesystem, it notifies the first process,
and the first process exits. The second process can take time to
exit, because it is not an interactive process.
2020-11-02 15:06:44 +03:00
- The output from the linker should be deterministic for the sake of
[build reproducibility](https://en.wikipedia.org/wiki/Reproducible_builds)
and ease of debugging. This might add a little bit of overhead to
the linker, but that shouldn't be too much.
2020-10-24 16:05:40 +03:00
- A .build-id, a unique ID embedded to an output file, is usually
computed by applying a cryptographic hash function (e.g. SHA-1) to
an output file. But it adds an extra time for linking because a
linker has to compute a SHA-1 checksum after the actual linking is
2020-10-26 16:26:44 +03:00
done. We should instead compute a SHA-1 for the tuple of (all input
2020-11-02 15:06:44 +03:00
files, command line options, linker version) as a build-id, as it
should uniquely identify the output.
2020-10-24 16:05:40 +03:00
2020-10-21 08:55:50 +03:00
- [Intel Threading Building
Blocks](https://github.com/oneapi-src/oneTBB) (TBB) is a good
library for parallel execution and has several concurrent
containers. We are particularly interested in using
`parallel_for_each` and `concurrent_hash_map`.
2020-10-22 19:14:11 +03:00
## Size of the problem
When linking Chrome, a linker reads 3,430,966,844 bytes of data in
total. The data contains the following items:
2020-10-23 07:17:21 +03:00
| Data item | Number
| ------------------------ | ------
| Object files | 30,723
2020-10-26 16:26:44 +03:00
| Public undefined symbols | 1,428,149
| Mergeable strings | 1,579,996
| Comdat groups | 9,914,510
| Regular sections (*1) | 10,345,314
| Public defined symbols | 10,512,135
| Symbols | 23,953,607
| Sections | 27,543,225
| Relocations against SHF_ALLOC sections | 39,496,375
| Relocations | 62,024,719
2020-10-23 07:23:12 +03:00
(*1) Sections that have to be copied from input object files to an
output file. Sections that contain relocations or symbols are for
2020-10-24 09:59:38 +03:00
example excluded.