This patch reduces the number of code paths that lead to the allocation
of a Region object. It's quite hard to follow the various ways in which
this can happen, so this is an effort to simplify.
The kernel sampling profiler will walk thread stacks during the timer
tick handler. Since it's not safe to trigger page faults during IRQ's,
we now avoid this by checking the page tables manually before accessing
each stack location.
This patch adds a globally shared zero-filled PhysicalPage that will
be mapped into every slot of every zero-filled AnonymousVMObject until
that page is written to, achieving CoW-like zero-filled pages.
Initial testing show that this doesn't actually achieve any sharing yet
but it seems like a good design regardless, since it may reduce the
number of page faults taken by programs.
If you look at the refcount of MM.shared_zero_page() it will have quite
a high refcount, but that's just because everything maps it everywhere.
If you want to see the "real" refcount, you can build with the
MAP_SHARED_ZERO_PAGE_LAZILY flag, and we'll defer mapping of the shared
zero page until the first NP read fault.
I've left this behavior behind a flag for future testing of this code.
We don't need to have this method anymore. It was a hack that was used
in many components in the system but currently we use better methods to
create virtual memory mappings. To prevent any further use of this
method it's best to just remove it completely.
Also, the APIC code is disabled for now since it doesn't help booting
the system, and is broken since it relies on identity mapping to exist
in the first 1MB. Any call to the APIC code will result in assertion
failed.
In addition to that, the name of the method which is responsible to
create an identity mapping between 1MB to 2MB was changed, to be more
precise about its purpose.
Instead of restoring CR3 to the current process's paging scope when a
ProcessPagingScope goes out of scope, we now restore exactly whatever
the CR3 value was when we created the ProcessPagingScope.
This fixes breakage in situations where a process ends up with nested
ProcessPagingScopes. This was making profiling very fragile, and with
this change it's now possible to profile g++! :^)
This will panic the kernel immediately if these functions are misused
so we can catch it and fix the misuse.
This patch fixes a couple of misuses:
- create_signal_trampolines() writes to a user-accessible page
above the 3GB address mark. We should really get rid of this
page but that's a whole other thing.
- CoW faults need to use copy_from_user rather than copy_to_user
since it's the *source* pointer that points to user memory.
- Inode faults need to use memcpy rather than copy_to_user since
we're copying a kernel stack buffer into a quickmapped page.
This should make the copy_to/from_user() functions slightly less useful
for exploitation. Before this, they were essentially just glorified
memcpy() with SMAP disabled. :^)
Technically the bottom 2MB is still identity-mapped for the kernel and
not made available to userspace at all, but for simplicity's sake we
can just ignore that and make "address < 0xc0000000" the canonical
check for user/kernel.
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.
For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.
Going forward, all new source files should include a license header.
We now use the regular "user" physical pages for on-demand page table
allocations. This was by far the biggest source of super physical page
exhaustion, so that bug should be a thing of the past now. :^)
We still have super pages, but they are barely used. They remain useful
for code that requires memory with a low physical address.
Fixes#1000.
After MemoryManager initialization, we now only leave the lowest 1MB
of memory identity-mapped. The very first (null) page is not present.
All other pages are RW but not X. Supervisor only.
The kernel and its static data structures are no longer identity-mapped
in the bottom 8MB of the address space, but instead move above 3GB.
The first 8MB above 3GB are pseudo-identity-mapped to the bottom 8MB of
the physical address space. But things don't have to stay this way!
Thanks to Jesse who made an earlier attempt at this, it was really easy
to get device drivers working once the page tables were in place! :^)
Fixes#734.
We now can create a cacheable Region, so when map() is called, if a
Region is cacheable then all the virtual memory space being allocated
to it will be marked as not cache disabled.
In addition to that, OS components can create a Region that will be
mapped to a specific physical address by using the appropriate helper
method.
We now validate the full range of userspace memory passed into syscalls
instead of just checking that the first and last byte of the memory are
in process-owned regions.
This fixes an issue where it was possible to avoid rejection of invalid
addresses that sat between two valid ones, simply by passing a valid
address and a size large enough to put the end of the range at another
valid address.
I added a little test utility that tries to provoke EFAULT in various
ways to help verify this. I'm sure we can think of more ways to test
this but it's at least a start. :^)
Thanks to mozjag for pointing out that this code was still lacking!
Incidentally this also makes backtraces work again.
Fixes#989.
We now refuse to boot on machines that don't support PAE since all
of our paging code depends on it.
Also let's only enable SSE and PGE support if the CPU advertises it.
Previously we assumed all hosts would have support for IA32_EFER.NXE.
This is mostly true for newer hardware, but older hardware will crash
and burn if you try to use this feature.
Now we check for support via CPUID.80000001[20].
Introduce one more (CPU) indirection layer in the paging code: the page
directory pointer table (PDPT). Each PageDirectory now has 4 separate
PageDirectoryEntry arrays, governing 1 GB of VM each.
A really neat side-effect of this is that we can now share the physical
page containing the >=3GB kernel-only address space metadata between
all processes, instead of lazily cloning it on page faults.
This will give us access to the NX (No eXecute) bit, allowing us to
prevent execution of memory that's not supposed to be executed.
To enforce this, we create two separate mappings of the same underlying
physical page. A writable mapping for the kernel, and a read-only one
for userspace (the one returned by sys$get_kernel_info_page.)
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.
The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.
If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.
Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
Now the userspace page allocator will search through physical regions,
and stop the search as it finds an available page.
Also remove an "address of" sign since we don't need that when
counting size of physical regions
After the page fault handler has found the region in which the fault
occurred, do the rest of the work in the region itself.
This patch also makes all fault types consistently crash the process
if a new page is needed but we're all out of pages.
We were always returning the full VM range of the partially-unmapped
Region to the range allocator. This caused us to re-use those addresses
for subsequent VM allocations.
This patch also skips creating a new VMObject in partial munmap().
Instead we just make split regions that point into the same VMObject.
This fixes the mysterious GCC ICE on large C++ programs.
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.
Kernel Regions are kept in various OwnPtr<Regions>'s.
Regions now only ever get unmapped when they are destroyed.
This was a workaround to be able to build on case-insensitive file
systems where it might get confused about <string.h> vs <String.h>.
Let's just not support building that way, so String.h can have an
objectively nicer name. :^)