mold/README.md

# mold: A Modern Linker

![mold image](mold.jpg)

This is a repository of a linker I'm currently developing as a
replacement for existing Unix linkers such as GNU BFD, GNU gold or
LLVM lld.

My goal was to make a linker that is as fast as concatenating input
object files with `cat` command. It may sound like an impossible goal,
but it's not entirely impossible because of the following two reasons:

1. `cat` is a simple single-threaded program which isn't the fastest
   one as a file copy command. My linker can use multiple threads to
   copy file contents more efficiently to save time to do extra work.

2. Copying file contents is I/O-bounded, and many CPU cores should be
   available during file copy. We can use them to do extra work while
   copying file contents.

Concretely speaking, I wanted to use the linker to link a Chromium
executable with full debug info (~2 GiB in size) just in 1 second.
LLVM's lld, the fastest open-source linker which I originally created
a few years ago, takes about 12 seconds to link Chromium on my machine.
So the goal is 12x performance bump over lld. Compared to GNU gold,
it's more than 50x.

It looks like mold has achieved the goal. It can link Chromium in 2
seconds with 8-cores/16-threads, and if I enable the preloading
feature (I'll explain it later), the latency of the linker for an
interactive use is less than 900 milliseconds. It is actualy faster
than `cat`.

Note that even though mold can create a runnable Chrome executable,
it is far from complete and not usable for production. mold is still
just a toy linker, and this is still just my pet project.

## Background

- Even though lld has significantly improved the situation, linking is
  still one of the slowest steps in a build. It is especially
  annoying when I changed one line of code and had to wait for a few
  seconds or even more for a linker to complete. It should be
  instantaneous. There's a need for a faster linker.

- The number of cores on a PC has increased a lot lately, and this
  trend is expected to continue. However, the existing linkers can't
  take the advantage of that because they don't scale well for more
  cores. I have a 64-core/128-thread machine, so my goal is to create
  a linker that uses the CPU nicely. mold should be much faster than
  other linkers on 4 or 8-core machines too, though.

- It looks to me that the designs of the existing linkers are somewhat
  similar, and I believe there are a lot of drastically different
  designs that haven't been explored yet. Develoeprs generally don't
  care about linkers as long as they work correctly, and they don't
  even think about creating a new one. So there may be lots of low
  hanging fruits there in this area.

## Basic design

- In order to achieve a `cat`-like performance, the most important
  thing is to fix the layout of an output file as quickly as possible, so
  that we can start copying actual data from input object files to an
  output file as soon as possible.

- Copying data from input files to an output file is I/O-bounded, so
  there should be room for doing computationally-intensive tasks while
  copying data from one file to another.

- We should allow the linker to preload object files from disk and
  parse them in memory before a complete set of input object files
  is ready. My idea is this: if a user invokes the linker with
  `--preload` flag along with other command line flags a few seconds
  before the actual linker invocation, then the following actual
  linker invocation with the same command line options (except
  `--preload` flag) becomes magically faster. Behind the scenes, the
  linker starts preloading object files on the first invocation and
  becomes a daemon. The second invocation of the linker notifies the
  daemon to reload updated object files and then proceed.

- Daemonizing alone wouldn't make the linker magically faster. We need
  to split the linker into two in such a way that the latter half of
  the process finishes as quickly as possible by speculatively parsing
  and preprocessing input files in the first half of the process. The
  key factor of success would be to design nice data structures that
  allows us to offload as much processing as possible from the second
  to the first half.

- One of the most time-consuming stage among linker stages is symbol
  resolution. To resolve symbols, we basically have to throw all
  symbol strings into a hash table to match undefined symbols with
  defined symbols. But this can be done in the daemon using [string
  interning](https://en.wikipedia.org/wiki/String_interning).

- Object files may contain a special section called a mergeable string
  section. The section contains lots of null-terminated strings, and
  the linker is expected to gather all mergeable string sections and
  merge their contents. So, if two object files contain the same
  string literal, for example, the resulting output will contain a
  single merged string. This step is time-consuming, but string
  merging can be done in the daemon using string interning.

- Static archives (.a files) contain object files, but the static
  archive's string table contains only defined symbols of member
  object files and lacks other types of symbols. That makes static
  archives unsuitable for speculative parsing. The daemon should
  ignore the string table of static archive and directly read all
  member object files of all archives to get the whole picture of
  all possible input files.

- If there's a relocation that uses a GOT of a symbol, then we have to
  create a GOT entry for that symbol. Otherwise, we shouldn't. That
  means we need to scan all relocation tables to fix the length and
  the contents of a .got section. This is perhaps time-consuming, but
  this step is parallelizable.

## Compatibility

- GNU ld, GNU gold and LLVM lld support essentially the same set of
  command line options and features. mold doesn't have to be
  completely compatible with them. As long as it can be used for
  linking large user-land programs, I'm fine with that. It is OK to
  leave some command line options unimplemented; if mold is blazingly
  fast, other projects would still be happy to adopt it by modifying
  their projects' build files.

- mold emits Linux executables and runs only on Linux. I won't avoid
  Unix-ism when writing code (e.g. I'll probably use fork(2)).
  I don't want to think about portability until mold becomes a thing
  that's worth to be ported.

## Linker Script

Linker script is an embedded language for the linker. It is mainly
used to control how input sections are mapped to output sections and
the layout of the output, but it can also do a lot of tricky stuff.
Its feature is useful especially for embedded programming, but it's
also an awfully underdocumented and complex language.

We have to implement a subset of the linker script language anwyay,
because on Linux, /usr/lib/x86_64-linux-gnu/libc.so is (despite its
name) not a shared object file but actually an ASCII file containing
linker script code to load the _actual_ libc.so file. But the feature
set for this purpose is very limited, and it is okay to implement them
to mold.

Besides that, we really don't want to implement the linker script
langauge. But at the same time, we want to satisfy the user needs that
are currently satisfied with the linker script langauge. So, what
should we do? Here is my observation:

- Linker script allows to do a lot of tricky stuff, such as specifying
  the exact layout of a file, inserting arbitrary bytes between
  sections, etc. But most of them can be done with a post-link binary
  editing tool (such as `objcopy`).

- It looks like there are two things that truely cannot be done by a
  post-link editing tool: (a) mapping input sections to output
  sections, and (b) applying relocations.

From the above observation, I believe we need to provide only the
following features instead of the entire linker script langauge:

- A method to specify how input sections are mapped to output
  sections, and

- a method to set addresses to output sections, so that relocations
  are applied based on desired adddresses.

I believe everything else can be done with a post-link binary editing
tool.

## Details

- If we aim to the 1 second goal for Chromium, every millisecond
  counts. We can't ignore the latency of process exit. If we mmap a
  lot of files, \_exit(2) is not instantaneous but takes a few hundred
  milliseconds because the kernel has to clean up a lot of
  resources. As a workaround, we should organize the linker command as
  two processes; the first process forks the second process, and the
  second process does the actual work. As soon as the second process
  writes a result file to a filesystem, it notifies the first process,
  and the first process exits. The second process can take time to
  exit, because it is not an interactive process.

- At least on Linux, it looks like the filesystem's performance to
  allocate new blocks to a new file is the limiting factor when
  creating a new large file and filling its contents using mmap.
  If you already have a large file on a filesystem, writing to it is
  much faster than creating a new fresh file and writing to it.
  Based on this observation, mold should overwrite to an existing
  executable file if exists. My quick benchmark showed that I could
  save 300 milliseconds when creating a 2 GiB output file.
  Linux doesn't allow to open an executable for writing if it is
  running (you'll get "text busy" error if you attempt). mold should
  fall back to the usual way if it fails to open an output file.

- As an implementation strategy, we do _not_ care about memory leak
  because we really can't save that much memory by doing precise
  memory management. It is because most objects that are allocated
  during an execution of mold are needed until the very end of the
  program. I'm sure this is an odd memory management scheme (or the
  lack thereof), but this is what LLVM lld does too.

- The output from the linker should be deterministic for the sake of
  [build reproducibility](https://en.wikipedia.org/wiki/Reproducible_builds)
  and ease of debugging. This might add a little bit of overhead to
  the linker, but that shouldn't be too much.

- A .build-id, a unique ID embedded to an output file, is usually
  computed by applying a cryptographic hash function (e.g. SHA-1) to
  an output file. This is a slow step, but we can speed it up by
  splitting a file into small chunks, computing SHA-1 for each chunk,
  and then computing SHA-1 of the concatenated SHA-1 hashes
  (i.e. constructing a [Markle
  Tree](https://en.wikipedia.org/wiki/Merkle_tree) of height 2).
  Modern x86 processors have purpose-built instructions for SHA-1 and
  can compute SHA-1 pretty quickly at about 2 GiB/s rate. Using 16
  cores, a build-id for a 2 GiB executable can be computed in 60 to 70
  milliseconds.

- BFD, gold, and lld support section garbage collection. That is, a
  linker runs a mark-sweep garbage collection on an input graph, where
  sections are vertices and relocations are edges, to discard all
  sections that are not reachable from the entry point symbol
  (i.e. `_start`) or a few other root sections. In mold, we are using
  multiple threads to mark sections concurrently.

- Similarly, BFD, gold an lld support Identical Comdat Folding (ICF)
  as a yet another size optimization. ICF merges two or more read-only
  sections that happen to have the same contents and relocations.
  To do that, we have to find isomorphic subgraphs from larger graphs.
  I implemented a new algorithm for mold, which is 5x faster than lld
  to do ICF for Chromium (from 5 seconds to 1 second).

- [Intel Threading Building
  Blocks](https://github.com/oneapi-src/oneTBB) (TBB) is a good
  library for parallel execution and has several concurrent
  containers. We are particularly interested in using
  `parallel_for_each` and `concurrent_hash_map`.

- TBB provides `tbbmalloc` which works better for multi-threaded
  applications than the glib'c malloc, but it looks like
  [jemalloc](https://github.com/jemalloc/jemalloc) and
  [mimalloc](https://github.com/microsoft/mimalloc) are a little bit
  more scalable than `tbbmalloc`.

## Size of the problem

When linking Chrome, a linker reads 3,430,966,844 bytes of data in
total. The data contains the following items:

| Data item                | Number
| ------------------------ | ------
| Object files             | 30,723
| Public undefined symbols | 1,428,149
| Mergeable strings        | 1,579,996
| Comdat groups            | 9,914,510
| Regular sections¹        | 10,345,314
| Public defined symbols   | 10,512,135
| Symbols                  | 23,953,607
| Sections                 | 27,543,225
| Relocations against SHF_ALLOC sections | 39,496,375
| Relocations              | 62,024,719

¹ Sections that have to be copied from input object files to an
output file. Sections that contain relocations or symbols are for
example excluded.

## Internals

In this section, I'll explain the internals of mold linker.

### A brief history of Unix and the Unix linker

Conceptually, what a linker does is pretty simple. A compiler compiles
a fragment of a program (a single source file) into a fragment of
machine code and data (an object file, which typically has the .o
extension), and a linker stiches them together into a single
executable or a shared library image.

In reality, modern linkers for Unix-like systems are much more
compilcated than the naive understanding because they have gradually
gained one feature at a time over the 50 years history of Unix, and
they are now something like a bag of lots of miscellaneous features in
which none of the features is more important than the others. It is
very easy to miss the forest for the trees, since for those who don't
know the details of the Unix linker, it is not clear which feature is
essential and which is not.

That being said, one thing is clear that at any point of Unix history,
a Unix linker has a coherent feature set for the Unix of that age. So,
let me entangle the history to see how the operating system, runtime
and linker have gained features that we see today. That should give
you an idea why a particular feature has been added to a linker in the
first place.

1. Original Unix didn't support shared library, and a program was
   always loaded to a fixed address. An executable was something like
   a memory dump which was just loaded to a particular address by the
   kernel. After loading, the kernel started executing the program by
   setting the instruction pointer to a particular address.

   The most essential feature for any linker is relocation processing.
   The original Unix linker of course supported that. Let me explain
   what that is.

   Individual object files are inevitably incomplete as a program,
   because when a compiler created them, it only see a part of an
   entire program. For example, if an object file contains a function
   call that refers other object file, the `call` instruction in the
   object cannot be complete, as the compiler has no idea as to what
   is the called function's address. To deal with this, the compiler
   emits a placeholder value (typically just zero) instead of a real
   address and leave a metadata in an object file saying "fix offset X
   of this file with an address of Y". That metadata is called
   "relocation". Relocations are typically processed by the linker.

   It is easy for a linker to apply relocations for the original Unix
   because a program is always loaded to a fixed address. It exactly
   knows the addresses of all functions and data when linking a
   program.

   Static library support, which is still an important feature of Unix
   linker, also dates back to this early period of Unix history.
   To understand what it is, imagine that you are trying to compile
   a program for the early Unix. You don't want to waste time to
   compile libc functions every time you compile your program (the
   computers of the era was incredibly slow), so you have already
   placed each libc function into a separate source file and compiled
   them individually. That means, you have object files for each libc
   function, e.g., printf.o, scanf.o, atoi.o, write.o, etc.

   Given this configuration, all you have to do to link your program
   against libc functions is to pick up a right set of libc object
   files and give them to the linker along with the object files of your
   program. But, keeping the linker command line in sync with the
   libc functions you are using in your program is bothersome. You can
   be conservative; you can specify all libc object files to the
   command line, but that leads to program bloat because the linker
   unconditionally link all object files given to it no matter whether
   they are used or not. So, a new feature was added to the linker to
   fix the problem. That is the static library, which is also called
   the archive file.

   An archive file is just a bundle of object files, just like zip
   file but in an uncompressed form. An achive file typically has the
   .a file extension and named after its contents. For example, the
   archive file containing all libc objects is named `libc.a`.

   If you pass an archive file along with other object files to the
   linker, the linker pulls out an object file from the archive _only
   when_ it is referenced by other object files. In other words,
   unlike object files directly given to a linker, object files
   wrapped in an archive are not linked to an output by default.
   An archive works as supplements to complete your program.

   Even today, you can still find a libc archive file. Run `ar t
   /usr/lib/x86_64-linux-gnu/libc.a` on Linux should give you a list
   of object files in the libc archive.

2. In '80s, Sun Microsystems, a leading commercial Unix vendor at the
   time, added a shared library support to their Unix variant, SunOS.

(This section is incomplete.)

## Concurrency strategy

In this section, I'll explain the high level concurrency strategy of
mold.

In most places, mold adopts data parallelism. That is, we have a huge
number of piece of data of the same kind, and we process each of them
individually using parallel for-loop. For example, after identifying
the exact set of input object files, we need to scan all relocation
tables to determine the sizes of .got and .plt sections. We do that
using a parallel for-loop. The granularity of parallel processing in
this case is the relocation table.

Data parallelism is very efficient and scalable because there's no
need for threads to communicate with each other while working on each
element of data. In addition to that, data parallelism is easy to
understand, as it is just a for-loop in which multiple iterations may
be executed in parallel. We don't use high-level communication or
synchronization mechanisms such as channels, futures, promises,
latches or something like that in mold.

In some cases, we need to share a little bit of data between threads
while executing a parallel for-loop. For example, the loop to scan
relocations turns on "requires GOT" or "requires PLT" flags in a
symbol. Symbol is a shared resource, and writing to them from multiple
threads without synchronization is unsafe. To deal with it, we made
the flag an atomic variable.

The other common pattern you can find in mold which is build on top of
the parallel for-loop is the map-reduce pattern. That is, we run a
parallel for-loop on a large data set to produce a small data set and
process the small data set with a single thread. Let me take a
build-id computation as an example. Build-id is typically computed by
applying a cryptographic hash function such as SHA-1 on a linker's
output file. To compute it, we first consider an output as a sequence
of 1 MiB blocks and compute a SHA-1 hash for each block in parallel.
Then, we concatenate the SHA-1 hashes and compute a SHA-1 hash on the
hashes to get a final build-id.

Finally, we use concurrent hashmap at a few places in mold. Concurrent
hashmap is a hashmap to which multiple threads can safely insert items
in parallel. We use it in the symbol resolution stage, for example.
To resolve symbols, we basically have to throw in all defined symbols
into a hash table, so that we can find a matching defined symbol for
an undefined symbol by name. We do the hash table insertion from a
parallel for-loop which iterates over a list of input files.

Overall, even though mold is highly scalable, it succeeded to avoid
complexties you often find in complex parallel programs. From high
level, mold just serially executes linker's internal passes one by
one. Each pass is parallelized using parallel for-loops.

## Rejected ideas

In this section, I'll explain the alternative designs I currently do
not plan to implement and why I turned them down.

- Placing variable-length sections at end of an output file and start
  copying file contents before fixing the output file layout

  Idea: Fixing the layout of regular sections seems easy, and if we
  place them at beginning of a file, we can start copying their
  contents from their input files to an output file. While copying
  file contents, we can compute the sizes of variable-length sections
  such as .got or .plt and place them at end of the file.

  Reason for rejection: I did not choose this design because I doubt
  if it could actually shorten link time and I think I don't need it
  anyway.

  The linker has to de-duplicate comdat sections (i.e. inline
  functions that are included into multiple object files), so we
  cannot compute the layout of regular sections until we resolve all
  symbols and de-duplicate comdats. That takes a few hundred
  milliseconds. After that, we can compute the sizes of
  variable-length sections in less than 100 milliseconds. It's quite
  fast, so it doesn't seem to make much sense to proceed without
  fixing the final file layout.

  The other reason to reject this idea is because there's good a
  chance for this idea to have a negative impact on linker's overall
  performance. If we copy file contents before fixing the layout, we
  can't apply relocations to them while copying because symbol
  addresses are not available yet. If we fix the file layout first, we
  can apply relocations while copying, which is effectively zero-cost
  due to a very good data locality. On the other hand, if we apply
  relocations long after we copy file contents, it's pretty expensive
  because section contents are very likely to have been evicted from
  CPU cache.

- Incremental linking

  Idea: Incremental linking is a technique to patch a previous linker's
  output file so that only functions or data that are updated from the
  previous build are written to it. It is expected to significantly
  reduce the amount of data copied from input files to an output file
  and thus speed up linking. GNU BFD and gold linkers support it.

  Reason for rejection: I turned it down because it (1) is
  complicated, (2) doesn't seem to speed it up that much and (3) has
  several practical issues. Let me explain each of them.

  First, incremental linking for real C/C++ programs is not as easy as
  one might think. Let me take malloc as an example. malloc is usually
  defined by libc, but you can implement it in your program, and if
  that's the case, the symbol `malloc` will be resolved to your
  function instead of the one in libc. If you include a library that
  defines malloc (such as libjemalloc or libtbbmallc) before libc,
  their malloc will override libc's malloc.

  Assume that you are using a nonstandard malloc. What if you remove
  your malloc from your code, or remove `-ljemalloc` from your
  Makefile? The linker has to include a malloc from libc, which may
  include more object files to satisfy its dependencies. Such code
  change can affect the entire program rather than just replacing one
  function. The same is true to adding malloc to your program. Making
  a local change doesn't necessarily result in a local change in the
  binary level.  It can easily have cascading effects.

  Some ELF fancy features make incremental linking even harder to
  implement. Take the weak symbol as an example. If you define `atoi`
  as a weak symbol in your program, and if you are not using `atoi`
  at all in your program, that symbol will be resolved to address
  0. But if you start using some libc function that indirectly calls
  `atoi`, then `atoi` will be included to your program, and your weak
  symbol will be resolved to that function. I don't know how to
  efficiently fix up a binary for this case.

  This is a hard problem, so existing linkers don't try too hard to
  solve it. For example, IIRC, gold falls back to full link if any
  function is removed from a previous build. If you want to not annoy
  users in the fallback case, you need to make full link fast anyway.

  Second, incremental linking itself has an overhead. It has to detect
  updated files, patch an existing output file and write additional
  data to an output file for future incremental linking. GNU gold, for
  instance, takes almost 30 seconds on my machine to do a null
  incremental link (i.e. no object files are updated from a previous
  build) for chrome. It's just too slow.

  Third, there are other practical issues in incremental linking. It's
  not reproducible, so your binary isn't going to be the same as other
  binaries even if you are compiling the same source tree using the
  same compiler toolchain. Or, it is complex and there might be a bug
  in it. If something doesn't work correctly, "remove --incremental
  from your Makefile and try again" could be a piece of advise, but
  that isn't ideal.

  So, all in all, incremental linking is tricky. I wanted to make full
  link as fast as possible, so that we don't have to think about how
  to workaround the slowness of full link.

- Defining a completely new file format and use it

  Idea: Sometimes, the ELF file format itself seems to be a limiting
  factor of improving linker's performance. We might be able to make a
  far better one if we create a new file format.

  Reason for rejection: I rejected the idea because it apparently has
  a practical issue (backward compatibility issue) and also doesn't
  seem to improve performance of linkers that much. As clearly
  demonstrated by mold, we can create a fast linker for ELF. I believe
  ELF isn't that bad, after all. The semantics of the existing Unix
  linkers, such as the name resolution algorithm or the linker script,
  have slowed the linkers down, but that's not a problem of the file
  format itself.

- Watching object files using inotify(2)

  Idea: When mold is running as a daemon for preloading, use
  inotify(2) to watch file system updates so that it can reload files
  as soon as they are updated.

  Reason for rejection: Just like the maximum number of files you can
  simultaneously open, the maximum number of files you can watch using
  inotify(2) isn't that large. Maybe just a single instance of mold is
  fine with inotify(2), but it may fail if you run multiple of it.

  The other reason for not doing it is because mold is quite fast
  without it anyway. Invoking stat(2) on each file for file update
  check takes less than 100 milliseconds for Chrome, and if most of
  the input files are not updated, parsing updated files takes almost
  no time.