This patch adds Space, a class representing a process's address space.
- Each Process has a Space.
- The Space owns the PageDirectory and all Regions in the Process.
This allows us to reorganize sys$execve() so that it constructs and
populates a new Space fully before committing to it.
Previously, we would construct the new address space while still
running in the old one, and encountering an error meant we had to do
tedious and error-prone rollback.
Those problems are now gone, replaced by what's hopefully a set of much
smaller problems and missing cleanups. :^)
This patch adds sys$msyscall() which is loosely based on an OpenBSD
mechanism for preventing syscalls from non-blessed memory regions.
It works similarly to pledge and unveil, you can call it as many
times as you like, and when you're finished, you call it with a null
pointer and it will stop accepting new regions from then on.
If a syscall later happens and doesn't originate from one of the
previously blessed regions, the kernel will simply crash the process.
The random address proposals were not checked with the size so it was
increasely likely to try to allocate outside of available space with
larger and larger sizes.
Now they will be ignored instead of triggering a Kernel assertion
failure.
This is a continuation of: c8e7baf4b8
This patch adds enforcement of two new rules:
- Memory that was previously writable cannot become executable
- Memory that was previously executable cannot become writable
Unfortunately we have to make an exception for text relocations in the
dynamic loader. Since those necessitate writing into a private copy
of library code, we allow programs to transition from RW to RX under
very specific conditions. See the implementation of sys$mprotect()'s
should_make_executable_exception_for_dynamic_loader() for details.
This can be used to request random VM placement instead of the highly
predictable regular mmap(nullptr, ...) VM allocation strategy.
It will soon be used to implement ASLR in the dynamic loader. :^)
We need to make sure other processors can grab the MM lock while we
wait, so release it when we might block. Reading the page from
disk may also block, so release it during that time as well.
This eliminates the window between calling Processor::current and
the member function where a thread could be moved to another
processor. This is generally not as big of a concern as with
Processor::current_thread, but also slightly more light weight.
If we find ourselves with a user-accessible, non-shared Region backed by
a SharedInodeVMObject, that's pretty bad news, so let's just panic the
kernel instead of getting abused.
There might be a better place for this kind of check, so I've added a
FIXME about putting more thought into that.
This was exploitable since the shared flag determines whether inode
permission checks are applied in sys$mprotect().
The bug was pretty hard to spot due to default arguments being used
instead. This patch removes the default arguments to make explicit
at each call site what's being done.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
The kernel ignored the first 8 MiB of RAM while parsing the memory map
because the kmalloc heaps and the super physical pages lived here. Move
all that stuff inside the .bss segment so that those memory regions are
accounted for, otherwise we risk overwriting boot modules placed next
to the kernel.
This adds support for FUTEX_WAKE_OP, FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET,
FUTEX_REQUEUE, and FUTEX_CMP_REQUEUE, as well well as global and private
futex and absolute/relative timeouts against the appropriate clock. This
also changes the implementation so that kernel resources are only used when
a thread is blocked on a futex.
Global futexes are implemented as offsets in VMObjects, so that different
processes can share a futex against the same VMObject despite potentially
being mapped at different virtual addresses.
Problem:
- Many constructors are defined as `{}` rather than using the ` =
default` compiler-provided constructor.
- Some types provide an implicit conversion operator from `nullptr_t`
instead of requiring the caller to default construct. This violates
the C++ Core Guidelines suggestion to declare single-argument
constructors explicit
(https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c46-by-default-declare-single-argument-constructors-explicit).
Solution:
- Change default constructors to use the compiler-provided default
constructor.
- Remove implicit conversion operators from `nullptr_t` and change
usage to enforce type consistency without conversion.
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.Everything:
The modifications in this commit were automatically made using the
following command:
find . -name '*.cpp' -exec sed -i -E 's/dbg\(\) << ("[^"{]*");/dbgln\(\1\);/' {} \;
If a TLB flush request is broadcast to other processors and the addresses
to flush are user mode addresses, we can ignore such a request on the
target processor if the page directory currently in use doesn't match
the addresses to be flushed. We still need to broadcast to all processors
in that case because the other processors may switch to that same page
directory at any time.
If we remap pages (e.g. lazy allocation) inside a VMObject that is
shared among more than one region, broadcast it to any other region
that may be mapping the same page.
Lazily committed shared memory was not working in situations where one
process would write to the memory and another would only read from it.
Since the reading process would never cause a write fault in the shared
region, we'd never notice that the writing process had added real
physical pages to the VMObject. This happened because the lazily
committed pages were marked "present" in the page table.
This patch solves the issue by always allocating shared memory up front
and not trying to be clever about it.
Before this change, we would sometimes map a region into the address
space with !is_shared(), and then moments later call set_shared(true).
I found this very confusing while debugging, so this patch makes us pass
the initial shared flag to the Region constructor, ensuring that it's in
the correct state by the time we first map the region.
By designating a committed page pool we can guarantee to have physical
pages available for lazy allocation in mappings. However, when forking
we will overcommit. The assumption is that worst-case it's better for
the fork to die due to insufficient physical memory on COW access than
the parent that created the region. If a fork wants to ensure that all
memory is available (trigger a commit) then it can use madvise.
This also means that fork now can gracefully fail if we don't have
enough physical pages available.
Rather than lazily committing regions by default, we now commit
the entire region unless MAP_NORESERVE is specified.
This solves random crashes in low-memory situations where e.g. the
malloc heap allocated memory, but using pages that haven't been
used before triggers a crash when no more physical memory is available.
Use this flag to create large regions without actually committing
the backing memory. madvise() can be used to commit arbitrary areas
of such regions after creating them.
This adds the ability for a Region to define volatile/nonvolatile
areas within mapped memory using madvise(). This also means that
memory purging takes into account all views of the PurgeableVMObject
and only purges memory that is not needed by all of them. When calling
madvise() to change an area to nonvolatile memory, return whether
memory from that area was purged. At that time also try to remap
all memory that is requested to be nonvolatile, and if insufficient
pages are available notify the caller of that fact.
This assertion cannot be safely/reliably made in the
~SharedInodeVMObject destructor. The problem is that
Inode::is_shared_vmobject holds a weak reference to the instance
that is being destroyed (ref count 0). Checking the pointer using
WeakPtr::unsafe_ptr will produce nullptr depending on timing in
this case, and WeakPtr::safe_ref will reliably produce a nullptr
as soon as the reference count drops to 0. The only case where
this assertion could succeed is when WeakPtr::unsafe_ptr returned
the pointer because it won the race against revoking it. And
because WeakPtr::safe_ref will always return a nullptr, we cannot
reliably assert this from the ~SharedInodeVMObject destructor.
Fixes#4621
When doing the cast to u64 on the page directory physical address,
the sign bit was being extended. This only beomes an issue when
crossing the 2 GiB boundary. At >= 2 GiB, the physical address
has the sign bit set. For example, 0x80000000.
This set all the reserved bits in the PDPTE, causing a GPF
when loading the PDPT pointer into CR3. The reserved bits are
presumably there to stop you writing out a physical address that
the CPU physically cannot handle, as the size of the reserved bits
is determined by the physical address width of the CPU.
This fixes this by casting to FlatPtr instead. I believe the sign
extension only happens when casting to a bigger type. I'm also using
FlatPtr because it's a pointer we're writing into the PDPTE.
sizeof(FlatPtr) will always be the same size as sizeof(void*).
This also now asserts that the physical address in the PDPTE is
within the max physical address the CPU supports. This is better
than getting a GPF, because CPU::handle_crash tries to do the same
operation that caused the GPF in the first place. That would cause
an infinite loop of GPFs until the stack was exhausted, causing a
triple fault.
As far as I know and tested, I believe we can now use the full 32-bit
physical range without crashing.
Fixes#4584. See that issue for the full debugging story.
Anything above or equal to the 2 GB mark has the left most bit set
(0x8000...), which was falsely interpreted as negative due to
local_offset being signed.
This makes it unsigned by using FlatPtr. To check for underflow as
was intended, lets use Checked instead.
Fixes#4585
Problem:
- `(void)` simply casts the expression to void. This is understood to
indicate that it is ignored, but this is really a compiler trick to
get the compiler to not generate a warning.
Solution:
- Use the `[[maybe_unused]]` attribute to indicate the value is unused.
Note:
- Functions taking a `(void)` argument list have also been changed to
`()` because this is not needed and shows up in the same grep
command.
Fix some problems with join blocks where the joining thread block
condition was added twice, which lead to a crash when trying to
unblock that condition a second time.
Deferred block condition evaluation by File objects were also not
properly keeping the File object alive, which lead to some random
crashes and corruption problems.
Other problems were caused by the fact that the Queued state didn't
handle signals/interruptions consistently. To solve these issues we
remove this state entirely, along with Thread::wait_on and change
the WaitQueue into a BlockCondition instead.
Also, deliver signals even if there isn't going to be a context switch
to another thread.
Fixes#4336 and #4330
This changes the Thread::wait_on function to not enable interrupts
upon leaving, which caused some problems with page fault handlers
and in other situations. It may now be called from critical
sections, with interrupts enabled or disabled, and returns to the
same state.
This also requires some fixes to Lock. To aid debugging, a new
define LOCK_DEBUG is added that enables checking for Lock leaks
upon finalization of a Thread.
This makes most operations thread safe, especially so that they
can safely be used in the Kernel. This includes obtaining a strong
reference from a weak reference, which now requires an explicit
call to WeakPtr::strong_ref(). Another major change is that
Weakable::make_weak_ref() may require the explicit target type.
Previously we used reinterpret_cast in WeakPtr, assuming that it
can be properly converted. But WeakPtr does not necessarily have
the knowledge to be able to do this. Instead, we now ask the class
itself to deliver a WeakPtr to the type that we want.
Also, WeakLink is no longer specific to a target type. The reason
for this is that we want to be able to safely convert e.g. WeakPtr<T>
to WeakPtr<U>, and before this we just reinterpret_cast the internal
WeakLink<T> to WeakLink<U>, which is a bold assumption that it would
actually produce the correct code. Instead, WeakLink now operates
on just a raw pointer and we only make those constructors/operators
available if we can verify that it can be safely cast.
In order to guarantee thread safety, we now use the least significant
bit in the pointer for locking purposes. This also means that only
properly aligned pointers can be used.
Notice that we ensured that the size is a multiple of the page size and
that there is at least one page there, otherwise, this change would be
invalid.
We create an empty region and then expand it:
// First iteration.
m_user_physical_regions.append(PhysicalRegion::create(addr, addr));
// Following iterations.
region->expand(region->lower(), addr);
So if the memory region only has one page, we would end up with an empty
region. Thus we need to do one more iteration.
There are plenty of places in the kernel that aren't
checking if they actually got their allocation.
This fixes some of them, but definitely not all.
Fixes#3390Fixes#3391
Also, let's make find_one_free_page() return nullptr
if it doesn't get a free index. This stops the kernel
crashing when out of memory and allows memory purging
to take place again.
Fixes#3487
Since the CPU already does almost all necessary validation steps
for us, we don't really need to attempt to do this. Doing it
ourselves doesn't really work very reliably, because we'd have to
account for other processors modifying virtual memory, and we'd
have to account for e.g. pages not being able to be allocated
due to insufficient resources.
So change the copy_to/from_user (and associated helper functions)
to use the new safe_memcpy, which will return whether it succeeded
or not. The only manual validation step needed (which the CPU
can't perform for us) is making sure the pointers provided by user
mode aren't pointing to kernel mappings.
To make it easier to read/write from/to either kernel or user mode
data add the UserOrKernelBuffer helper class, which will internally
either use copy_from/to_user or directly memcpy, or pass the data
through directly using a temporary buffer on the stack.
Last but not least we need to keep syscall params trivial as we
need to copy them from/to user mode using copy_from/to_user.
I decided to modify MappedROM.h because all other entried in Forward.h
are also classes, and this is visually more pleasing.
Other than that, it just doesn't make any difference which way we resolve
the conflicts.
Rather than trying to find a contiguous set of bits of size 1, just
find one single available bit using a hint.
Also, try to randomize returned physical pages a bit by placing them
into a 256 entry queue rather than making them available immediately.
Then, once the queue is filled, pick a random one, make it available
again and use that slot for the latest page to be returned.
Sometimes a physical underlying page may be there, but we may be
unable to allocate a page table that may be needed to map it. Bubble
up such mapping errors so that they can be handled more appropriately.
If allocating a page table triggers purging memory, we need to call
quickmap_pd again to make sure the underlying physical page is
remapped to the correct one. This is needed because purging itself
may trigger calls to ensure_pte as well.
Fixes#3370
When cloning a purgeable memory region (which happens on fork),
we need to preserve the "was purged" and "volatile" state of the
original region, or they will always appear as non-volatile and
unpurged regions in the child process.
Fixes#3374.
Add an ExpandableHeap and switch kmalloc to use it, which allows
for the kmalloc heap to grow as needed.
In order to make heap expansion to work, we keep around a 1 MiB backup
memory region, because creating a region would require space in the
same heap. This means, the heap will grow as soon as the reported
utilization is less than 1 MiB. It will also return memory if an entire
subheap is no longer needed, although that is rarely possible.
MemoryManager cannot use the Singleton class because
MemoryManager::initialize is called before the global constructors
are run. That caused the Singleton to be re-initialized, causing
it to create another MemoryManager instance.
Fixes#3226
Rather than hardcoding where the kmalloc pool should be, place
it at the end of the kernel image instead. This avoids corrupting
global variables or other parts of the kernel as it grows.
Fixes#3257
MemoryManager cannot use the Singleton class because
MemoryManager::initialize is called before the global constructors
are run. That caused the Singleton to be re-initialized, causing
it to create another MemoryManager instance.
The SI prefixes "k", "M", "G" mean "10^3", "10^6", "10^9".
The IEC prefixes "Ki", "Mi", "Gi" mean "2^10", "2^20", "2^30".
Let's use the correct name, at least in code.
Only changes the name of the constants, no other behavior change.
This is something I've been meaning to do for a long time, and here we
finally go. This patch moves all sys$foo functions out of Process.cpp
and into files in Kernel/Syscalls/.
It's not exactly one syscall per file (although it could be, but I got
a bit tired of the repetitive work here..)
This makes hacking on individual syscalls a lot less painful since you
don't have to rebuild nearly as much code every time. I'm also hopeful
that this makes it easier to understand individual syscalls. :^)
MemoryManager::quickmap_pd and MemoryManager::quickmap_pt can only
be called by one processor at the time anyway, since anything using
these must have the MM lock held. So, no need to inform the other
CPUs to flush their TLBs, we can just flush our own.
We can now properly initialize all processors without
crashing by sending SMP IPI messages to synchronize memory
between processors.
We now initialize the APs once we have the scheduler running.
This is so that we can process IPI messages from the other
cores.
Also rework interrupt handling a bit so that it's more of a
1:1 mapping. We need to allocate non-sharable interrupts for
IPIs.
This also fixes the occasional hang/crash because all
CPUs now synchronize memory with each other.
When delivering urgent signals to the current thread
we need to check if we should be unblocked, and if not
we need to yield to another process.
We also need to make sure that we suppress context switches
during Process::exec() so that we don't clobber the registers
that it sets up (eip mainly) by a context switch. To be able
to do that we add the concept of a critical section, which are
similar to Process::m_in_irq but different in that they can be
requested at any time. Calls to Scheduler::yield and
Scheduler::donate_to will return instantly without triggering
a context switch, but the processor will then asynchronously
trigger a context switch once the critical section is left.
- If rdseed is not available, fallback to rdrand.
- If rdrand is not available, block for entropy, or use insecure prng
depending on if user wants fast or good random.
Add a MappedROM::find_chunk_starting_with() helper since that's a very
common usage pattern in clients of this code.
Also convert MultiProcessorParser from a persistent singleton object
to a temporary object constructed via a failable factory function.
This patch adds a MappedROM abstraction to the Kernel VM subsystem.
It's basically the read-only byte buffer equivalent of a TypedMapping.
We use this in the ACPI and MP table parsers to scan for interesting
stuff in low memory instead of doing a bunch of address arithmetic.
This was supposed to be the foundation for some kind of pre-kernel
environment, but nobody is working on it right now, so let's move
everything back into the kernel and remove all the confusion.
Ultimately we should not panic just because we can't fully commit a VM
region (by populating it with physical pages.)
This patch handles some of the situations where commit() can fail.
This caused us to report one purged page per occurrence of the shared
zero page in a purgeable memory region, despite it being a no-op.
Thanks to Sergey for spotting the bad assertion removal that led to
this being found!
This patch adds PageFaultResponse::OutOfMemory which informs the fault
handler that we were unable to allocate a necessary physical page and
cannot continue.
In response to this, the kernel will crash the current process. Because
we are OOM, we can't symbolicate the crash like we normally would
(since the ELF symbolication code needs to allocate), so we also
communicate to Process::crash() that we're out of memory.
Now we can survive "allocate 300 MB" (only the allocate process dies.)
This is definitely not perfect and can easily end up killing a random
innocent other process who happened to allocate one page at the wrong
time, but it's a *lot* better than panicking on OOM. :^)
This function has a lot of callers that don't bother checking if it
returns successfully or not. We'll need to handle failure in a bunch
of places and then we can remove this assertion.
If we OOM during a CoW fault and fail to allocate a new page for the
writing process, just leave the original VMObject alone so everyone
else can keep using it.
Since a Region is basically a view into a potentially larger VMObject,
it was always necessary to include the Region starting offset when
accessing its underlying physical pages.
Until now, you had to do that manually, but this patch adds a simple
Region::physical_page() for read-only access and a physical_page_slot()
when you want a mutable reference to the RefPtr<PhysicalPage> itself.
A lot of code is simplified by making use of this.
This memory range was set up using 2MB pages by the code in boot.S.
Because of that, the kernel image protection code didn't work, since it
assumed 4KB pages.
We now switch to 4KB pages during MemoryManager initialization. This
makes the kernel image protection code work correctly again. :^)
PT_POKE writes a single word to the tracee's address space.
Some caveats:
- If the user requests to write to an address in a read-only region, we
temporarily change the page's protections to allow it.
- If the user requests to write to a region that's backed by a
SharedInodeVMObject, we replace the vmobject with a PrivateIndoeVMObject.
This patch adds the minherit() syscall originally invented by OpenBSD.
Only the MAP_INHERIT_ZERO mode is supported for now. If set on an mmap
region, that region will be zeroed out on fork().
There was a frequently occurring pattern of "map this physical address
into kernel VM, then read from it, then unmap it again".
This new typed_map() encapsulates that logic by giving you back a
typed pointer to the kind of structure you're interested in accessing.
It returns a TypedMapping<T> that can be used mostly like a pointer.
When destroyed, the TypedMapping object will unmap the memory. :^)
This was caught by running all crash tests with "crash -A".
Basically, non-readable pages need to not be mapped *at all* so that
a "page not present" exception is provoked on access.
Unfortunately x86 does not support write-only mappings, so this is
the best we can do.
Fixes#1336.
Also, duplicate data in dbg() and klog() calls were removed.
In addition, leakage of virtual address to kernel log is prevented.
This is done by replacing kprintf() calls to dbg() calls with the
leaked data instead.
Also, other kprintf() calls were replaced with klog().
This patch reduces the number of code paths that lead to the allocation
of a Region object. It's quite hard to follow the various ways in which
this can happen, so this is an effort to simplify.
It's now up to the caller to provide a VMObject when constructing a new
Region object. This will make it easier to handle things going wrong,
like allocation failures, etc.
When forking a process, we now turn all of the private inode-backed
mmap() regions into copy-on-write regions in both the parent and child.
This patch also removes an assertion that becomes irrelevant.
We don't have to log the process name/PID/TID, dbg() automatically adds
that as a prefix to every line.
Also we don't have to do .characters() on Strings passed to dbg() :^)
We now have PrivateInodeVMObject and SharedInodeVMObject, corresponding
to MAP_PRIVATE and MAP_SHARED respectively.
Note that PrivateInodeVMObject is not used yet.
The kernel sampling profiler will walk thread stacks during the timer
tick handler. Since it's not safe to trigger page faults during IRQ's,
we now avoid this by checking the page tables manually before accessing
each stack location.
We're not equipped to deal with page faults during an IRQ handler,
so add an assertion so we can immediately tell what's wrong.
This is why profiling sometimes hangs the system -- walking the stack
of the profiled thread causes a page fault and things fall apart.
Anonymous VM objects should never have null entries in their physical
page list. Instead, "empty" or untouched pages should refer to the
shared zero page.
Fixes#1237.
A 16-bit refcount is just begging for trouble right nowl.
A 32-bit refcount will be begging for trouble later down the line,
so we'll have to revisit this eventually. :^)
This patch adds a globally shared zero-filled PhysicalPage that will
be mapped into every slot of every zero-filled AnonymousVMObject until
that page is written to, achieving CoW-like zero-filled pages.
Initial testing show that this doesn't actually achieve any sharing yet
but it seems like a good design regardless, since it may reduce the
number of page faults taken by programs.
If you look at the refcount of MM.shared_zero_page() it will have quite
a high refcount, but that's just because everything maps it everywhere.
If you want to see the "real" refcount, you can build with the
MAP_SHARED_ZERO_PAGE_LAZILY flag, and we'll defer mapping of the shared
zero page until the first NP read fault.
I've left this behavior behind a flag for future testing of this code.
This was only used by HashTable::dump() which I used when doing the
first HashTable implementation. Removing this allows us to also remove
most includes of <AK/kstdio.h>.
It doesn't look healthy to create raw references into an array before
a temporary unlock. In fact, that temporary unlock looks generally
unhealthy, but it's a different problem.
Previously we were only checking that each of the virtual pages in the
specified range were valid.
This made it possible to pass in negative buffer sizes to some syscalls
as long as (address) and (address+size) were on the same page.
Previously it was not possible for this function to fail. You could
exploit this by triggering the creation of a VMObject whose physical
memory range would wrap around the 32-bit limit.
It was quite easy to map kernel memory into userspace and read/write
whatever you wanted in it.
Test: Kernel/bxvga-mmap-kernel-into-userspace.cpp
When using dbg() in the kernel, the output is automatically prefixed
with [Process(PID:TID)]. This makes it a lot easier to understand which
thread is generating the output.
This patch also cleans up some common logging messages and removes the
now-unnecessary "dbg() << *current << ..." pattern.
We don't need to have this method anymore. It was a hack that was used
in many components in the system but currently we use better methods to
create virtual memory mappings. To prevent any further use of this
method it's best to just remove it completely.
Also, the APIC code is disabled for now since it doesn't help booting
the system, and is broken since it relies on identity mapping to exist
in the first 1MB. Any call to the APIC code will result in assertion
failed.
In addition to that, the name of the method which is responsible to
create an identity mapping between 1MB to 2MB was changed, to be more
precise about its purpose.
There is no real "read protection" on x86, so we have no choice but to
map write-only pages simply as "present & read/write".
If we get a read page fault in a non-readable region, that's still a
correctness issue, so we crash the process. It's by no means a complete
protection against invalid reads, since it's trivial to fool the kernel
by first causing a write fault in the same region.
uintptr_t is 32-bit or 64-bit depending on the target platform.
This will help us write pointer size agnostic code so that when the day
comes that we want to do a 64-bit port, we'll be in better shape.
Instead of restoring CR3 to the current process's paging scope when a
ProcessPagingScope goes out of scope, we now restore exactly whatever
the CR3 value was when we created the ProcessPagingScope.
This fixes breakage in situations where a process ends up with nested
ProcessPagingScopes. This was making profiling very fragile, and with
this change it's now possible to profile g++! :^)
Previously, when deallocating a range of VM, we would sort and merge
the range list. This was quite slow for large processes.
This patch optimizes VM deallocation in the following ways:
- Use binary search instead of linear scan to find the place to insert
the deallocated range.
- Insert at the right place immediately, removing the need to sort.
- Merge the inserted range with any adjacent range(s) in-line instead
of doing a separate merge pass into a list copy.
- Add Traits<Range> to inform Vector that Range objects are trivial
and can be moved using memmove().
I've also added an assertion that deallocated ranges are actually part
of the RangeAllocator's initial address range.
I've benchmarked this using g++ to compile Kernel/Process.cpp.
With these changes, compilation goes from ~41 sec to ~35 sec.
This will panic the kernel immediately if these functions are misused
so we can catch it and fix the misuse.
This patch fixes a couple of misuses:
- create_signal_trampolines() writes to a user-accessible page
above the 3GB address mark. We should really get rid of this
page but that's a whole other thing.
- CoW faults need to use copy_from_user rather than copy_to_user
since it's the *source* pointer that points to user memory.
- Inode faults need to use memcpy rather than copy_to_user since
we're copying a kernel stack buffer into a quickmapped page.
This should make the copy_to/from_user() functions slightly less useful
for exploitation. Before this, they were essentially just glorified
memcpy() with SMAP disabled. :^)
Technically the bottom 2MB is still identity-mapped for the kernel and
not made available to userspace at all, but for simplicity's sake we
can just ignore that and make "address < 0xc0000000" the canonical
check for user/kernel.
It's now an error to sys$mmap() a file as writable if it's currently
mapped executable by anyone else.
It's also an error to sys$execve() a file that's currently mapped
writable by anyone else.
This fixes a race condition vulnerability where one program could make
modifications to an executable while another process was in the kernel,
in the middle of exec'ing the same executable.
Test: Kernel/elf-execve-mmap-race.cpp
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.
For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.
Going forward, all new source files should include a license header.
This is not ASLR, but it does de-trivialize exploiting the ELF loader
which would previously always parse executables at 0x01001000 in every
single exec(). I've taken advantage of this multiple times in my own
toy exploits and it's starting to feel cheesy. :^)
We now use the regular "user" physical pages for on-demand page table
allocations. This was by far the biggest source of super physical page
exhaustion, so that bug should be a thing of the past now. :^)
We still have super pages, but they are barely used. They remain useful
for code that requires memory with a low physical address.
Fixes#1000.
After MemoryManager initialization, we now only leave the lowest 1MB
of memory identity-mapped. The very first (null) page is not present.
All other pages are RW but not X. Supervisor only.
The kernel and its static data structures are no longer identity-mapped
in the bottom 8MB of the address space, but instead move above 3GB.
The first 8MB above 3GB are pseudo-identity-mapped to the bottom 8MB of
the physical address space. But things don't have to stay this way!
Thanks to Jesse who made an earlier attempt at this, it was really easy
to get device drivers working once the page tables were in place! :^)
Fixes#734.
We now can create a cacheable Region, so when map() is called, if a
Region is cacheable then all the virtual memory space being allocated
to it will be marked as not cache disabled.
In addition to that, OS components can create a Region that will be
mapped to a specific physical address by using the appropriate helper
method.
When loading a new executable, we now map the ELF image in kernel-only
memory and parse it there. Then we use copy_to_user() when initializing
writable regions with data from the executable.
Note that the exec() syscall still disables SMAP protection and will
require additional work. This patch only affects kernel-originated
process spawns.
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.
Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.
There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.
THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.
Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
Inode::size() may try to take a lock, so we can't be calling it with
interrupts disabled.
This fixes a kernel hang when trying to execute a binary in a TmpFS.