To process version scriots, we have to match glob patterns against
symbol strings. Sometimes, we have hundreds or thousands of glob
patterns and have to match them against millions of mangled long
C++ symbol names. This step can be very slow.
In this patch, I implemented the Aho-Corasick algorithm to match glob
patterns to symbol strings as quickly as possible. For the details
of the algorithm, see https://en.wikipedia.org/wiki/Glob_(programming).
This patch improves mold's performance for programs that uses large
version scripts. For example, linking libQt6Gui.so.6.3.0 reduced from
1.10s to 0.05s with this patch.
This patch also changes how symbol versions are applied if two or more
version patterns match to a single symbol string. Previously, the last
one in a script file took precedence. Now, the first one takes
precedence. I believe the new behavior is compatible with GNU ld.
Fixes https://github.com/rui314/mold/issues/156
Fixed https://github.com/rui314/mold/issues/287
ARM64 branch instructions have only a 25-bit displacement. Non-thumb
instructions are always aligned to 4 byte boundaries, but with that
implicit trailing two zeros considered, they can represent only a 27-bit
displacement. That means branch instructions can jump to a location only
if it is within a ±128 MiB range.
If a branch destination is farther than that, a linker has to emit machine
code sequence that constructs a full 32-bit address in a register to jump
to the final destination, and redirect the branch to that code sequence.
Such code sequence is called a "range extension thunk" or just "thunk".
Previously, mold didn't support range extension thunks, so it couldn't link
large programs. That would fail with an "relocation out of range" error.
Now, mold gained a feature to create thunks and can link large programs.
Thunk creation is an interesting algorithmic problem. We need to insert
a thunk for at least in every 128 MiB chunk, because otherwise branch
instructions wouldn't be able to jump to a thunk. Adding an entry to a thunk
could slightly enlarge the distance between a branch instruction location
and its destination if the thunk is in between them. That could make the
branch that was previously reachable unreachable.
Usually, this problem is solved by an iterative algorithm. With the
iterative algorithm, a linker check for reachability of all relocations,
create new thunks if necessary, and repeat it until no new thunks are
created.
I implemented a different algorithm than that in this patch. The algorithm
implemented in this patch is guaranteed to work in O(n) where n is the
number of relocations. This algorithm might be novel.
And the algorithm implemented in this patch is quite fast. It can create
thunks in 80 milliseconds on a 16-core Amazon Graviton 2 machine for
clang-14 that has an ~100MB .text section.
.relr.dyn is a new section that has been implemented in other linkers
recently. That section contains only the RELATIVE-type dynamic
relocations (i.e. base relocations). Compared to the regular
.rela.dyn, a .relr.dyn's size is typically less than 1/10 because the
section is compressed.
Since PIEs (position-independent executables) tend to contain lots of
RELATIVE-type relocations and PIEs are now the default on many Linux
distributions for security reasons, .relr.dyn is more effective than
it was. It can reduce binary size by a few percent or more.
Note that the runtime support is catching up, so binaries built with
`-pack-dyn-relocs=relr` may not work on your system unless you are
running a very recent version of Linux.
In mold, a relocation refers either a symbol or a section piece.
A section piece is a segment of a mergeable section such as
.rodata.str.1 or .debug_str.
Previously, we preprocessed all relocations referring mergeable
sections to find their corresponding section pieces. We did this
because we want to find a relocation target as quickly as possible
for `gc_sections` and `scan_rels`.
However, we didn't actually have to do that for non-alloc sections,
as non-alloc sections are not subject to neither `gc_sections` nor
`scan_rels`. So, we could skip relocation preprocessing for non-alloc
sections. This commit implement that optimization.
It looks like this is an effective optimization for programs that have
large debug info because debug sections tend to contain a lot of
relocations referring a .debug_str which is a mergeable section.
Here is a few notable examples.
Output size Before After
clang-14 2.2 GiB 1.455s 1.396s (4% faster)
mongodb 4.8 GiB 2.341s 1.925s (17% faster)
These nubmers were measured on a simulated 16-core 32-thread machine.
Previously, our parallel symbol resolution algorithm was not
deterministic in edge cases. As an example, consider the following two
source files:
foo.c:
inline void fn1() { ... }
bar.c:
inline void fn1() { ... }
void fn2() { ... }
Let's say you compile these files and put them into an archive file.
If mold decided to pull out `foo.o` first for `fn1` and then `bar.o`
for `fn2`, then both `foo.o` and `bar.o` are included into a result.
However, if mold pulled out `bar.o` first, then there's no chance for
`foo.o` to be pulled out, so only `foo.o` would be included into a
result.
The algorithm implemented in this commit should be deterministic.
We do not override symbols when we mark live objects.
Previously, mold crashes due to an invalid regex pattern exception
when `[...]` is given as a version script pattern.
Fixes https://github.com/rui314/mold/issues/258
Previously, mold put all global symbols into .gnu.hash. Although I
believe it was not an error, it bloated the size of .gnu.hash because
.gnu.hash needs only exported symbols.
https://github.com/rui314/mold/issues/255
This is very hacky but highly practical, so I couldn't resist to not
implement this. We should support LTO natively in the future. In the
meantime, this feature should work as a poor-man's replacement.
Fixes https://github.com/rui314/mold/issues/242
This is not a linker feature, but in order to learn how Mach-O
executables are constructed, I'll implement a dump feature.
I'll remove the feature once I understand the structure of Mach-O
binaries.