This patch move AddressSpace (the per-process memory manager) to using
the new atomic "place" APIs in RegionTree as well, just like we did for
MemoryManager in the previous commit.
This required updating quite a few places where VM allocation and
actually committing a Region object to the AddressSpace were separated
by other code.
All you have to do now is call into AddressSpace once and it'll take
care of everything for you.
Instead of first allocating the VM range, and then inserting a region
with that range into the MM region tree, we now do both things in a
single atomic operation:
- RegionTree::place_anywhere(Region&, size, alignment)
- RegionTree::place_specifically(Region&, address, size)
To reduce the number of things we do while locking the region tree,
we also require callers to provide a constructed Region object.
This patch ports MemoryManager to RegionTree as well. The biggest
difference between this and the userspace code is that kernel regions
are owned by extant OwnPtr<Region> objects spread around the kernel,
while userspace regions are owned by the AddressSpace itself.
For kernelspace, there are a couple of situations where we need to make
large VM reservations that never get backed by regular VMObjects
(for example the kernel image reservation, or the big kmalloc range.)
Since we can't make a VM reservation without a Region object anymore,
this patch adds a way to create unbacked Region objects that can be
used for this exact purpose. They have no internal VMObject.)
RegionTree holds an IntrusiveRedBlackTree of Region objects and vends a
set of APIs for allocating memory ranges.
It's used by AddressSpace at the moment, and will be used by MM soon.
This patch stops using VirtualRangeAllocator in AddressSpace and instead
looks for holes in the region tree when allocating VM space.
There are many benefits:
- VirtualRangeAllocator is non-intrusive and would call kmalloc/kfree
when used. This new solution is allocation-free. This was a source
of unpleasant MM/kmalloc deadlocks.
- We consolidate authority on what the address space looks like in a
single place. Previously, we had both the range allocator *and* the
region tree both being used to determine if an address was valid.
Now there is only the region tree.
- Deallocation of VM when splitting regions is no longer complicated,
as we don't need to keep two separate trees in sync.
Now that we reclaim the memory range that is created by KASLR before
the start of the kernel image, there's no need to be conservative with
the KASLR offset.
This ensures we don't just waste the memory range between the default
base load address and the actual load address that was shifted by the
KASLR offset.
If we crashed in the middle of mapping in Regions, some of the regions
may not have a page directory yet, and will result in a crash when
Region::remap() is called.
If someone specifically wants contiguous memory in the low-physical-
address-for-DMA range ("super pages"), they can use the
allocate_dma_buffer_pages() helper.
Function-local `static constexpr` variables can be `constexpr`. This
can reduce memory consumption, binary size, and offer additional
compiler optimizations.
These changes result in a stripped x86_64 kernel binary size reduction
of 592 bytes.
As make<T> is infallible, it really should not be used anywhere in the
Kernel. Instead replace with fallible `new (nothrow)` calls, that will
eventually be error-propagated.
When a page fault led to the mapping of a new physical page, we were
updating the page tables for *every* region that shared the same
underlying VMObject.
Let's just not do that, avoiding a bunch of unnecessary page table
updates and TLB invalidations.
Ideally the x86 fault handler would only do x86 specific things and
delegate the rest of the work to MemoryManager. This patch moves some of
the address checks to a more generic place.
This avoids taking and releasing the MM lock just to reject an address
that we can tell from just looking at it that it won't ever be in the
kernel regions tree.
When the values we're setting are not actually u32s and the size of the
area we're setting is PAGE_SIZE-aligned and a multiple of PAGE_SIZE in
size, there's no point in using fast_u32_fill, as that forces us to use
STOSDs instead of STOSQs.
This allows us to enable Write-Combine on e.g. framebuffers,
significantly improving performance on bare metal.
To keep things simple we right now only use one of up to three bits
(bit 7 in the PTE), which maps to the PA4 entry in the PAT MSR, which
we set to the Write-Combine mode on each CPU at boot time.
We were already using a non-intrusive RedBlackTree, and since the kernel
regions tree is non-owning, this is a trivial conversion that makes a
bunch of the tree operations infallible (by being allocation-free.) :^)
PageDirectory gets initialized step-by-step in
PageDirectory::try_create_for_userspace(). This initialization may fail
anywhere in this function - for example, we may not be able to
allocate a directory table, in which case
PageDirectory::try_create_for_userspace() will return a null pointer.
We recognize this condition and early-return ENOMEM. However, at this
point, we need to correctly destruct the only partially initialized
PageDirectory. Previously, PageDirectory::~PageDirectory() would assume
that the object it was destructing was always fully initialized. It now
uses the new helper PageDirectory::is_cr3_initialized() to correctly
recognize when the directory table was not yet initialized. This helper
checks if the pointer to the directory table is null. Only if it is not
null does the destructor try to fetch the directory table using
PageDirectory::cr3().
These infallible resource factory functions were only there to ease the
conversion to the new factory functions. Since all child classes of
VMObject now use the fallible resource factory functions, we don't
need the infallible versions anymore.
This commit moves the allocation of the resources required for
SharedInodeVMObject from its constructors to its factory functions.
We're making this change to expose the fallibility of the allocation.
This commit moves the allocation of the resources required for
PrivateInodeVMObject from its constructors to its factory functions.
We're making this change to expose the fallibility of the allocation.
This commit moves the allocation of the resources required for
InodeVMObject from its constructors to the constructors of its child
classes.
We're making this change to give the child classes the chance to expose
the fallibility of the allocation.
This commit moves the allocation of the resources required for
AnonymousVMObject from its constructors to its factory functions.
We're making this change to expose the fallibility of the allocation.
This commit moves the allocation of the resources required for VMObject
from its constructors to the constructors of its child classes.
We're making this change to give the child classes the chance to expose
the fallibility of the allocation.
The only purpose of the remap() in Region::try_clone() is to ensure
non-writable page table entries for CoW regions. If a region is already
non-writable, there's no need to waste time updating the page tables.
When mapping or unmapping completely inaccessible memory regions,
we don't need to update the page tables at all. This saves a bunch of
time in some situations, most notably during dynamic linking, where we
make a large VM reservation and immediately throw it away. :^)
We were already only tracking kernel regions, this patch just makes it
more clear by having it reflected in the name of the registration
helpers.
We also stop calling them for userspace regions, avoiding some spinlock
action in such cases.
This optimization was added when region lookup was O(n), before we had
the O(log n) RedBlackTree. Let's remove it to simplify the code, as we
have no evidence that it remains valuable.
Previously we would only remove them from the map if they were attached
to an AddressSpace, even though we would always add them to the map on
construction. This results in an assertion failure on destruction if
the page directory was never attached to an AddressSpace. (for example,
on an allocation failure of said AddressSpace)
This mostly just moved the problem, as a lot of the callers are not
capable of propagating the errors themselves, but it's a step in the
right direction.
When deleting an entire AddressSpace, we don't need to do TLB flushes
at all (since the entire page directory is going away anyway).
We also don't need to deallocate VM ranges one by one, since the entire
VM range allocator will be deleted anyway.
The purpose of the PageDirectory::m_page_tables map was really just
to act as ref-counting storage for PhysicalPage objects that were
being used for the directory's page tables.
However, this was basically redundant, since we can find the physical
address of each page table from the page directory, and we can find the
PhysicalPage object from MemoryManager::get_physical_page_entry().
So if we just manually ref() and unref() the pages when they go in and
out of the directory, we no longer need PageDirectory::m_page_tables!
Not only does this remove a bunch of kmalloc() traffic, it also solves
a race condition that would occur when lazily adding a new page table
to a directory:
Previously, when MemoryManager::ensure_pte() would call HashMap::set()
to insert the new page table into m_page_tables, if the HashMap had to
grow its internal storage, it would call kmalloc(). If that kmalloc()
would need to perform heap expansion, it would end up calling
ensure_pte() again, which would clobber the page directory mapping used
by the outer invocation of ensure_pte().
The net result of the above bug would be that any invocation of
MemoryManager::ensure_pte() could erroneously return a pointer into
a kernel page table instead of the correct one!
This whole problem goes away when we remove the HashMap, as ensure_pte()
no longer does anything that allocates from the heap.
Not all drivers need the PhysicalPage output parameter while creating
a DMA buffer. This overload will avoid creating a temporary variable
for the caller
The cacheable parameter to allocate_kernel_region should be explicitly
set to No as this region is used to do physical memory transfers. Even
though most architectures ignore this even if it is set, it is better
to make this explicit.
FixedArray now doesn't expose any infallible constructors anymore.
Rather, it exposes fallible methods. Therefore, it can be used for
OOM-safe code.
This commit also converts the rest of the system to use the new API.
However, as an example, VMObject can't take advantage of this yet,
as we would have to endow VMObject with a fallible static
construction method, which would require a very fundamental change
to VMObject's whole inheritance hierarchy.
So far we only had mmap(2) functionality on the /dev/mem device, but now
we can also do read(2) on it.
The test unit was updated to check we are doing it safely.
As it was pointed by Idan Horowitz, the rest of the method doesn't
assume we have any reserved ranges to allow mmap(2) to work on them, so
the VERIFY is not needed at all.
This was a premature optimization from the early days of SerenityOS.
The eternal heap was a simple bump pointer allocator over a static
byte array. My original idea was to avoid heap fragmentation and improve
data locality, but both ideas were rooted in cargo culting, not data.
We would reserve 4 MiB at boot and only ended up using ~256 KiB, wasting
the rest.
This patch replaces all kmalloc_eternal() usage by regular kmalloc().
Previously, the heap expansion logic could end up calling kmalloc
recursively, which was quite messy and hard to reason about.
This patch redesigns heap expansion so that it's kmalloc-free:
- We make a single large virtual range allocation at startup
- When expanding, we bump allocate VM from that region
- When expanding, we populate page tables directly ourselves,
instead of going via MemoryManager.
This makes heap expansion a great deal simpler. However, do note that it
introduces two new flaws that we'll need to deal with eventually:
- The single virtual range allocation is limited to 64 MiB and once
exhausted, kmalloc() will fail. (Actually, it will PANIC for now..)
- The kmalloc heap can no longer shrink once expanded. Subheaps stay
in place once constructed.
The function to protect ksyms after initialization, is only used during
boot of the system, so it can be UNMAP_AFTER_INIT as well.
This requires we switch the order of the init sequence, so we now call
`MM.protect_ksyms_after_init()` before `MM.unmap_text_after_init()`.
As a small cleanup, this also makes `page_round_up` verify its
precondition with `page_round_up_would_wrap` (which callers are expected
to call), rather than having its own logic.
Fixes#11297.
This error only ever gets propagated to the userspace if
MAP_FIXED_NOREPLACE is requested, as MAP_FIXED unmaps intersecting
ranges beforehand, and non-fixed mmap() calls will just fall back to
allocating anywhere.
Linux specifies MAP_FIXED_NOREPLACE to return EEXIST when it can't
allocate, we now match that behavior.
The Prekernel's memory is only accessed until MemoryManager has been
initialized. Keeping them around afterwards is both unnecessary and bad,
as it prevents the userland from using the 0x100000-0x155000 virtual
address range.
Co-authored-by: Idan Horowitz <idan.horowitz@gmail.com>
In order to reduce our reliance on __builtin_{ffs, clz, ctz, popcount},
this commit removes all calls to these functions and replaces them with
the equivalent functions in AK/BuiltinWrappers.h.
We can leave the .ksyms section mapped-but-read-only and then have the
symbols index simply point into it.
Note that we manually insert null-terminators into the symbols section
while parsing it.
This gets rid of ~950 KiB of kmalloc_eternal() at startup. :^)
Since it's possible to determine where the small zones will start to
occur for each PhysicalRegion, we can use arithmetic so that the call
time for both large and small zones is identical.
It's not enough to just find the largest-address-not-above the argument,
we must also check that the found region actually contains the argument.
Regressed in a23edd42b8, thanks to Idan
for pointing this out.
Most of the time, we will be freeing physical pages within the
full-sized zones. We can do some simple math to find the right zone
immediately instead of looping through the zones, checking each one.
We still do loop through the slack/remainder zones at the end.
There's probably an even nicer way to solve this, but this is already a
nice improvement. :^)
We were already doing this for userspace memory regions (in the
Memory::AddressSpace class), so let's do it for kernel regions as well.
This gives a nice speed-up on test-js and probably basically everything
else as well. :^)
SIGSTKFLT is a signal that signifies a stack fault in a x87 coprocessor,
this signal is not POSIX and also unused by Linux and the BSDs, so let's
use SIGSEGV so programs that setup signal handlers for the common
signals could still handle them in serenity.
To make sure we don't lose changes, shared file mappings will now be
fully synced when they are unmapped, whether explicitly or implicitly
(by the program exiting/crashing/etc.)
This can incur a lot of work, since we don't keep track of dirty pages,
but that's something we can optimize down the road. :^)
This allows userspace to trigger a full (FIXME) flush of a shared file
mapping to disk. We iterate over all the mapped pages in the VMObject
and write them out to the underlying inode, one by one. This is rather
naive, and there's lots of room for improvement.
Note that shared file mappings are currently not possible since mmap()
returns ENOTSUP for PROT_WRITE+MAP_SHARED. That restriction will be
removed in a subsequent commit. :^)
This is a handy helper that copies out the full contents of a physical
page into a caller-provided buffer. It uses quickmapping internally
(and takes the MM lock for the duration.)
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
Instead of signalling allocation failure with a bool return value
(false), we now use ErrorOr<void> and return ENOMEM as appropriate.
This allows us to use TRY() and MUST() with Vector. :^)
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
This small change simplifies the function a bit but also fixes a problem
with it.
Let's take an example to see this:
Let's say we have a reserved range between 0xe0000 to 0xfffff (EBDA),
then we want to map from the memory device (/dev/mem) the entire
EBDA to a program. If a program tries to map more than 131072 bytes,
the current logic will work - the start address is 0xe0000, and ofcourse
it's below the limit, hence it passes the first two restrictions.
Then, the third if statement will fail if we try to mmap more than
the said allowed bytes.
However, let's take another scenario, where we try to mmap from
0xf0000 - but we try to mmap less than 131072 - but more than 65536.
In such case, we again pass the first two if statements, but the third
one is passed two, because it doesn't take into account the offseted
address from the start of the reserved range (0xe0000). In such case,
a user can easily mmap 65535 bytes above 0x100000. This might
seem negligible. However, it's still a severe bug that can theoretically
be exploited into a info leak or tampering with important kernel
structures.
SonarCloud flagged this "Code Smell", where we are accessing these
static methods as if they are instance methods. While it is technically
possible, it is very confusing to read when you realize they are static
functions.
When testing the RTL8168 driver, it seems we can't allocate super pages
anymore. Either we expand the super pages range, or find a solution to
dynamically expand the range (or let drivers utilize other ranges).
This function was checking 1 byte after the provided range, which caused
it to reject valid userspace ranges that happened to end exactly at the
top of the user address space.
This fixes a long-standing issue with mysterious Optional errors in
Coredump::write_regions(). (It happened when trying to add a memory
region at the very top of the address space to a coredump.)
This expands the reach of error propagation greatly throughout the
kernel. Sadly, it also exposes the fact that we're allocating (and
doing other fallible things) in constructors all over the place.
This patch doesn't attempt to address that of course. That's work for
our future selves.
The default template argument is only used in one place, and it
looks like it was probably just an oversight. The rest of the Kernel
code all uses u8 as the type. So lets make that the default and remove
the unused template argument, as there doesn't seem to be a reason to
allow the size to be customizable.
This commit moves the KResult and KResultOr objects to Kernel/API to
signify that they may now be freely used by userspace code at points
where a syscall-related error result is to be expected. It also exposes
KResult and KResultOr to the global namespace to make it nicer to use
for userspace code.
And also try_create<T> => try_make_ref_counted<T>.
A global "create" was a bit much. The new name matches make<T> better,
which we've used for making single-owner objects since forever.
We have seen cases where the map fails, but we return the region
to the caller, causing them to page fault later on when they touch
the region.
The fix is to always observe the return code of map/remap.
The quickmap_page() and unquickmap_page() functions are used to map a
single physical page at a kernel virtual address for temporary access.
These use the per-CPU quickmap buffer in the page tables, and access to
this is guarded by the MM lock. To prevent bugs, quickmap_page() should
not *take* the MM lock, but rather verify that it is already held!
This exposed two situations where we were using quickmap without holding
the MM lock during page fault handling. This patch is forced to fix
these issues (which is great!) :^)
This has several benefits:
1) We no longer just blindly derefence a null pointer in various places
2) We will get nicer runtime error messages if the current process does
turn out to be null in the call location
3) GCC no longer complains about possible nullptr dereferences when
compiling without KUBSAN
The VMObject class now manages its own instance list (it was previously
a member of MemoryManager.) Removal from the list is done safely on the
last unref(), closing a race window in the previous implementation.
Note that VMObject::all_instances() now has its own lock instead of
using the global MM lock.
This makes for nicer handling of errors compared to checking whether a
RefPtr is null. Additionally, this will give way to return different
types of errors in the future.
First off: unregister the region from MemoryManager before unmapping it.
The order of operations here was a bit strange, presumably to avoid a
situation where a fault would happen while unmapping, and the fault
handler would find the MemoryManager region list in an invalid state.
Unregistering it before unmapping sidesteps the whole problem, and
allows us to easily fix another problem: a deadlock could occur due
to inconsistent acquisition order (PageDirectory must come before MM.)
We don't want to be holding the MM lock if it's a user region and we
have to consult the page directory, since that can lead to a deadlock
if we don't already have the page directory lock.
It may happen that CPU A manages to page in from the same inode
while we're just entering the same page fault handler on CPU B.
Handle it gracefully by checking if the data has already been paged in
(instead of VERIFY'ing that it hasn't) and then remap the page if that's
the case.
Taking a reference or a pointer to a value that's not aligned properly
is undefined behavior. While `[[gnu::packed]]` ensures that reads from
and writes to fields of packed structs is a safe operation, the
information about the reduced alignment is lost when creating pointers
to these values.
Weirdly enough, GCC's undefined behavior sanitizer doesn't flag these,
even though the doc of `-Waddress-of-packed-member` says that it usually
leads to UB. In contrast, x86_64 Clang does flag these, which renders
the 64-bit kernel unable to boot.
For now, the `address-of-packed-member` warning will only be enabled in
the kernel, as it is absolutely crucial there because of KUBSAN, but
might get excessively noisy for the userland in the future.
Also note that we can't append to `CMAKE_CXX_FLAGS` like we do for other
flags in the kernel, because flags added via `add_compile_options` come
after these, so the `-Wno-address-of-packed-member` in the root would
cancel it out.
When booting AP's, we identity map a region at 0x8000 while doing the
initial bringup sequence. This is the only thing in the kernel that
requires an identity mapping, yet we had a bunch of generic API's and a
dedicated VirtualRangeAllocator in every PageDirectory for this purpose.
This patch simplifies the situation by moving the identity mapping logic
to the AP boot code and removing the generic API's.
...and also RangeAllocator => VirtualRangeAllocator.
This clarifies that the ranges we're dealing with are *virtual* memory
ranges and not anything else.