It was doing a bunch of things it didn't need to do. I think we had
misunderstood the base class as having copied m_lock in its copy
constructor but it's actually default initialized.
We had issues with committed physical pages getting miscounted in some
situations, and instead of figuring out what was going wrong and making
sure all the commits had matching uncommits, this patch makes the
problem go away by adding an RAII class to manage this instead. :^)
MemoryManager::commit_user_physical_pages() now returns an (optional)
CommittedPhysicalPageSet. You can then allocate pages from the page set
by calling take_one() on it. Any unallocated pages are uncommitted upon
destruction of the page set.
Since we only ever add 1 super physical region, theres no reason to
add the extra redirection and allocation that comes with a dynamically
sized Vector.
We try to read twice from the RTC CMOS registers, and if the values are
not the same for 5 attempts, we know there's a malfunction with the
hardware so we declare these values as bogus in the kernel log.
When we try to query the time from the RTC CMOS, we try to check if
the CMOS is updated. If it is updated for a long period of time (as a
result of hardware malfunction), break the loop and return Unix epoch
time.
Previously we would just print "ASSERTION FAILED: false", which was
kinda cryptic and also didn't make it clear whether this was a TODO or
an unreachable condition. Now, we actually print "ASSERTION FAILED:
TODO", making it crystal clear.
Create the disk cache up front, so we can verify it succeeds.
Make the KBuffer allocation fail-able, so we can properly handle
failure when the user asks up to mount a Ext2 filesystem under
OOM conditions.
The IPv4Socket requires a DoubleBuffer for storage of any data it
received on the socket. However it was previously using the default
constructor which can not observe allocation failure. Address this by
plumbing the receive buffer through the various derived classes.
LocalSockets keep a DoubleBuffer for both client and server usage.
This change converts the usage from using the default constructor
which is unable to observe OOM, to the new try_create factory and
plumb the result through the constructor.
We need to expose the ability for DoubleBuffer creation to expose
failure, as DoubleBuffer depends on KBuffer, which also has to be able
to expose failure during OOM.
We will remove the non OOM API once all users have been converted.
The sys$alarm() syscall has logic to cache a m_alarm_timer to avoid
allocating a new timer for every call to alarm. Unfortunately that
logic was broken, and there were conditions in which we could have
a timer allocated, but it was no longer on the timer queue, and we
would attempt to cancel that timer again resulting in an infinite
loop waiting for the timers callback to fire.
To fix this, we need to track if a timer is currently in use or not,
allowing us to avoid attempting to cancel inactive timers.
Luke and Tom did the initial investigation, I just happened to have
time to write a repro and attempt a fix, so I'm adding them as the
as co-authors of this commit.
Co-authored-by: Luke <luke.wilde@live.co.uk>
Co-authored-by: Tom <tomut@yahoo.com>
Problem:
- New `all_of` implementation takes the entire container so the user
does not need to pass explicit begin/end iterators. This is unused
except is in tests.
Solution:
- Make use of the new and more user-friendly version where possible.
Read the appropriate registers for RTL8139, RTL8168 and E1000.
For NE2000 just assume 10mbit full duplex as there is no indicator
for it in the pure NE2000 spec. Mock values for loopback.
Previously there was no way for Serenity to send a packet without an
established socket connection, and there was no way to appropriately
respond to a SYN packet on a non-listening port. This patch will respond
to any non-established socket attempts with the appropraite RST/ACK,
letting the client know to close the connection.
In accordance with RFC 793, if the receiver is in the SYN-SENT state
it should respond to a RST by aborting the connection and immediately
move to the CLOSED state.
Previously the system would ACK all RST/ACKs, and the remote peer would
just respond with more RST packets.
This isn't needed for Process / Thread as they only reference it
by pointer and it's already part of Kernel/Forward.h. So just include
it where the implementation needs to call it.
I was working on some more KASAN changes and realized the system
no longer links when passing -DENABLE_KERNEL_ADDRESS_SANITIZER=ON.
Prekernel will likely never have KASAN support given it's limited
environment, so just suppress it's usage.
When cloning an AnonymousVMObject, committed COW pages are shared
between the parent and child object. Whicever object COW's first will
take the shared committed page, and if the other object ends up doing
a COW as well, it will notice that the page is no longer shared by
two objects and simple remap it as read/write.
When a child is COW'ed again, while still having shared committed
pages with its own parent, the grandchild object will now join in the
sharing pool with its parent and grandparent. This means that the first
2 of 3 objects that COW will draw from the shared committed pages, and
3rd will remap read/write.
Previously, we would "fork" the shared committed pages when cloning,
which could lead to a situation where the grandparent held on to 1 of
the 3 needed shared committed pages. If both the child and grandchild
COW'ed, they wouldn't have enough pages, and since the grandparent
maintained an extra +1 ref count on the page, it wasn't possible to
to remap read/write.
We currently always crash if a user attempts to unmap a range that
does not intersect with an existing region, no matter the size. This
happens because we will never explicitly check to see if the search
for intersecting regions found anything, instead loop over the results,
which might be an empty vector. We then attempt to deallocate the
requested range from the `RangeAllocator` unconditionally, which will
be invalid if the specified range is not managed by the RangeAllocator.
We will assert validating m_total_range.contains(..) the range we are
requesting to deallocate.
This fix to this is straight forward, error out if we weren't able to
find any intersections.
You can get stress-ng to attempt this pattern with the following
arguments, which will attempt to unmap 0x0 through some large offset:
```
stress-ng --vm-segv 1
```
Fixes: #8483
Co-authored-by: Federico Guerinoni <guerinoni.federico@gmail.com>
AnonymousVMObject::set_volatile() assumes that nobody ever calls it on
non-purgeable objects, so let's make sure we don't do that.
Also return EINVAL instead of EPERM for non-anonymous VM objects so the
error codes match.
Previously it was possible to leak the file descriptor if we error out
after allocating the first descriptor. Now we perform both fd
allocations back to back so we can handle the potential error when
processing the second fd allocation.
The way the Process::FileDescriptions::allocate() API works today means
that two callers who allocate back to back without associating a
FileDescription with the allocated FD, will receive the same FD and thus
one will stomp over the other.
Naively tracking which FileDescriptions are allocated and moving onto
the next would introduce other bugs however, as now if you "allocate"
a fd and then return early further down the control flow of the syscall
you would leak that fd.
This change modifies this behavior by tracking which descriptions are
allocated and then having an RAII type to "deallocate" the fd if the
association is not setup the end of it's scope.
To add a new per-CPU data structure, add an ID for it to the
ProcessorSpecificDataID enum.
Then call ProcessorSpecific<T>::initialize() when you are ready to
construct the per-CPU data structure on the current CPU. It can then
be accessed via ProcessorSpecific<T>::get().
This patch replaces the existing hard-coded mechanisms for Scheduler
and MemoryManager per-CPU data structure.
This enables further work on implementing KASLR by adding relocation
support to the pre-kernel and updating the kernel to be less dependent
on specific virtual memory layouts.
This allows us to specify virtual addresses for things the kernel should
access via virtual addresses later on. By doing this we can make the
kernel independent from specific physical addresses.
Previously the kernel relied on a fixed offset between virtual and
physical addresses based on the kernel's load address. This allows us
to specify an independent offset.
Instead of doing a reset via triple-fault, let's just shutdown the QEMU
virtual machine because this is already a QEMU-specific handling code
for Self-Test CI mode.
In preparation for modifying the Kernel IOCTL API to return KResult
instead of int, we need to fix this ioctl to an argument to receive
it's return value, instead of using the actual function return value.
It's easy to forget the responsibility of validating and safely copying
kernel parameters in code that is far away from syscalls. ioctl's are
one such example, and bugs there are just as dangerous as at the root
syscall level.
To avoid this case, utilize the AK::Userspace<T> template in the ioctl
kernel interface so that implementors have no choice but to properly
validate and copy ioctl pointer arguments.
GCC and Clang allow us to inject a call to a function named
__sanitizer_cov_trace_pc on every edge. This function has to be defined
by us. By noting down the caller in that function we can trace the code
we have encountered during execution. Such information is used by
coverage guided fuzzers like AFL and LibFuzzer to determine if a new
input resulted in a new code path. This makes fuzzing much more
effective.
Additionally this adds a basic KCOV implementation. KCOV is an API that
allows user space to request the kernel to start collecting coverage
information for a given user space thread. Furthermore KCOV then exposes
the collected program counters to user space via a BlockDevice which can
be mmaped from user space.
This work is required to add effective support for fuzzing SerenityOS to
the Syzkaller syscall fuzzer. :^) :^)
When we get a COW fault and discover that whoever we were COW'ing
together with has either COW'ed that page on their end (or they have
unmapped/exited) we simplify life for ourselves by clearing the COW
bit and keeping the page we already have. (No need to COW if the page
is not shared!)
The act of doing this does not return a committed page to the pool.
In fact, that committed page we had reserved for this purpose was used
up (allocated) by our COW buddy when they COW'ed the page.
This fixes a kernel panic when running TestLibCMkTemp. :^)
This makes a kernel panic immediately fail the on-target CI job.
Otherwise the failed job looks like a test timeout unless one digs into
the details of the job.
We don't need an entirely separate VMObject subclass to influence the
location of the physical pages.
Instead, we simply allocate enough physically contiguous memory first,
and then pass it to the AnonymousVMObject constructor that takes a span
of physical pages.
VMObject already has an IntrusiveList of all the Regions that map it.
We were keeping a counter in addition to this, and only using it in
a single place to avoid iterating over the list in case it only had
1 entry.
Simplify VMObject by removing this counter and always iterating the
list even if there's only 1 entry. :^)
This was previously used for a single debug logging statement during
memory purging. There are no remaining users of this weak pointer,
so let's get rid of it.
If a purgeable VM object is in the "volatile" state when we're asked
to make a COW clone of it, make life simpler by simply "purging"
the cloned object right away.
This effectively means that a fork()'ed child process will discover
its purgeable+volatile regions to be empty if/when it tries making
them non-volatile.
This patch changes the semantics of purgeable memory.
- AnonymousVMObject now has a "purgeable" flag. It can only be set when
constructing the object. (Previously, all anonymous memory was
effectively purgeable.)
- AnonymousVMObject now has a "volatile" flag. It covers the entire
range of physical pages. (Previously, we tracked ranges of volatile
pages, effectively making it a page-level concept.)
- Non-volatile objects maintain a physical page reservation via the
committed pages mechanism, to ensure full coverage for page faults.
- When an object is made volatile, it relinquishes any unused committed
pages immediately. If later made non-volatile again, we then attempt
to make a new committed pages reservation. If this fails, we return
ENOMEM to userspace.
mmap() now creates purgeable objects if passed the MAP_PURGEABLE option
together with MAP_ANONYMOUS. anon_create() memory is always purgeable.
Right now, NE2000 NICs don't work because the link is down by default
and this will never change. Of all the NE2000 documentation I looked
at I could not find a link status indicator, so just assume the link
is up.
next_packet_page points to a page, but was being compared to a byte
offset rather than a page offset when adjusting the BOUNDARY register
when the ring buffer wraps around.
Fixes#8327.
This bug manifests it self when the caller to sys$pledge() passes valid
promises, but invalid execpromises. The code would apply the promises
and then return an error for the execpromises. This leaves the user in
a confusing state, as the promises were silently applied, but we return
an error suggesting the operation has failed.
Avoid this situation by tweaking the implementation to only apply the
promises / execpromises after all validation has occurred.
GCC-11 added a new option `-fzero-call-used-regs` which causes the
compiler to zero function arguments before return of a function. The
goal being to reduce the possible attack surface by disarming ROP
gadgets that might be potentially useful to attackers, and reducing
the risk of information leaks via stale register data. You can find
the GCC commit below[0].
This is a mitigation I noticed on the Linux KSPP issue tracker[1] and
thought it would be useful mitigation for the SerenityOS Kernel.
The reduction in ROP gadgets is observable using the ropgadget utility:
$ ROPgadget --nosys --nojop --binary Kernel | tail -n1
Unique gadgets found: 42754
$ ROPgadget --nosys --nojop --binary Kernel.RegZeroing | tail -n1
Unique gadgets found: 41238
The size difference for the i686 Kernel binary is negligible:
$ size Kernel Kernel.RegZerogin
text data bss dec hex filename
13253648 7729637 6302360 27285645 1a0588d Kernel
13277504 7729637 6302360 27309501 1a0b5bd Kernel.RegZeroing
We don't have any great workloads to measure regressions in Kernel
performance, but Kees Cook mentioned he measured only around %1
performance regression with this enabled on his Linux kernel build.[2]
References:
[0] d10f3e900b
[1] https://github.com/KSPP/linux/issues/84
[2] https://lore.kernel.org/lkml/20210714220129.844345-1-keescook@chromium.org/
This patch greatly simplifies VMObject locking by doing two things:
1. Giving VMObject an IntrusiveList of all its mapping Region objects.
2. Removing VMObject::m_paging_lock in favor of VMObject::m_lock
Before (1), VMObject::for_each_region() was forced to acquire the
global MM lock (since it worked by walking MemoryManager's list of
all regions and checking for regions that pointed to itself.)
With each VMObject having its own list of Regions, VMObject's own
m_lock is all we need.
Before (2), page fault handlers used a separate mutex for preventing
overlapping work. This design required multiple temporary unlocks
and was generally extremely hard to reason about.
Instead, page fault handlers now use VMObject's own m_lock as well.
Despite what the declaration would have us believe these are not "u8*".
If they were we wouldn't have to use the & operator to get the address
of them and then cast them to "u8*"/FlatPtr afterwards.
Since we're taking from the committed set of pages, there should never
be a reason for this call to fail.
Also add a Badge to disallow taking committed pages from anywhere but
the Region class.
We don't need to have a dedicated API for creating a VMObject with a
single page, the multi-page API option works in all cases.
Also make the API take a Span<NonnullRefPtr<PhysicalPage>> instead of
a NonnullRefPtrVector<PhysicalPage>.
Depending on the values it might be difficult to figure out whether a
value is decimal or hexadecimal. So let's make this more obvious. Also
this allows copying and pasting those numbers into GNOME calculator and
probably also other apps which auto-detect the base.
KBufferBuilder is always allowed to expand if it wants to. This
restriction was added a long time ago when it was unsafe to allocate
VM while generating ProcFS contents.
Use a Mutex instead of a SpinLock to protect the per-FileDescription
generated data cache. This allows processes to go to sleep while
waiting their turn.
Also don't try to be clever by reusing existing cache buffers.
Just allocate KBuffers as needed (and make sure to surface failures.)
KBufferBuilder exists for code that wants to build a KBuffer instead
of a String. KBuffer is backed by anonymous VM, while String is backed
by a kernel heap allocation.
Advisory locks don't actually prevent other processes from writing to
the file, but they do prevent other processes looking to acquire and
advisory lock on the file.
This implementation currently only adds non-blocking locks, which are
all I need for now.
Before we start disabling acquisition of the big process lock for
specific syscalls, make sure to document and assert that all the
lock is held during all syscalls.
There are cases where we want to conditionally take a lock, but still
would like to use an RAII type to make sure we don't leak the lock.
This was previously impossible to do with `MutexLocker` due to it's
design. This commit tweaks the design to allow the object to be
initialized to an "empty" state without a lock associated, so it does
nothing, and then later a lock can be "attached" to the locker.
I realized that the get_lock() API's where also unused, and would no
longer make sense for empty locks, so they were removed.
Move these to MM to simplify the flow of the syscall handler.
While here, also make sure we hold the process space lock for
the duration of the validation to avoid potential issues where
another thread attempts to modify the process space during the
validation. This will allow us to move the validation out of the
big process lock scope in a future change.
Additionally utilize the new no_lock variants of functions to avoid
unnecessary recursive process space spinlock acquisitions.
Currently all syscalls run under the Process:m_big_lock, which is an
obvious bottleneck. Long term we would like to remove the big lock and
replace it with the required fine grained locking.
To facilitate this goal we need a way of gradually decomposing the big
lock into the all of the required fine grained locks. This commit
introduces instrumentation to the syscall table, allowing the big lock
requirement to be toggled on/off per syscall.
Eventually when we are finished, no syscall will required the big lock,
and we'll be able to remove all of this instrumentation.
The entire process is not needed, just require the user to pass in the
Space. Also provide no_lock variant to use when you already have the
VM/Space lock acquired, to avoid unnecessary recursive spinlock
acquisitions.
Depending on the exact layout of the .ksyms section the kernel would
fail to boot because the kernel_load_end variable didn't account for the
section's size.
The non CPU specific code of the kernel shouldn't need to deal with
architecture specific registers, and should instead deal with an
abstract view of the machine. This allows us to remove a variety of
architecture specific ifdefs and helps keep the code slightly more
portable.
We do this by exposing the abstract representation of instruction
pointer, stack pointer, base pointer, return register, etc on the
RegisterState struct.
Allocate all the RX buffers in one big memory region (and same for TX.)
This removes 38 lines from every crash dump (and just seems like a
reasonable idea in general.)
We need to cast physical addresses to PhysicalPtr instead of FlatPtr,
which is currently always 64 bits. However, if one day we were to
support 32 bit non-pae mode then it would also truncate appropriately.
This switches tracking CPU usage to more accurately measure time in
user and kernel land using either the TSC or another time source.
This will also come in handy when implementing a tickless kernel mode.
As threads come and go, we can't simply account for how many time
slices the threads at any given point may have been using. We need to
also account for threads that have since disappeared. This means we
also need to track how many time slices we have expired globally.
However, because this doesn't account for context switches outside of
the system timer tick values may still be under-reported. To solve this
we will need to track more accurate time information on each context
switch.
This also fixes top's cpu usage calculation which was still based on
the number of context switches.
Fixes#6473
This adds a ".profile" extension to perfcore files written by the
Kernel. Also, the process name is now visible in the perfcore filename.
Furthermore, this patch adds error handling for the case where the
filename generated by the Kernel is already taken. In that case, a digit
will be added to the filename (before the extension).
This also adds some more error logging to dump_perfcore().
This implements a simple bootloader that is capable of loading ELF64
kernel images. It does this by using QEMU/GRUB to load the kernel image
from disk and pass it to our bootloader as a Multiboot module.
The bootloader then parses the ELF image and sets it up appropriately.
The kernel's entry point is a C++ function with architecture-native
code.
Co-authored-by: Liav A <liavalb@gmail.com>
When a Thread is being unblocked and we need to re-lock the process
big_lock and re-locking blocks again, then we may end up in
Thread::block again while still servicing the original lock's
Thread::block. So permit recursion as long as it's only the big_lock
that we block on again.
Fixes#8822
We often get queried for the root inode, and it will always be cached
in memory anyway, so let's make Ext2FS::root_inode() fast by keeping
the root inode in a dedicated member variable.
To count the remaining children, we simply need to traverse the
directory and increment a counter. No need for a custom virtual that
all file systems have to implement. :^)
Unless we're accessing mutex-guarded metadata, there's no need to
acquire the inode lock.
The file system ID or inode index of a constructed inode will never
change, for example.
We should never request a regions removal that we don't currently
own. We currently assert this everywhere else by all callers.
Instead lets just push the assert down into the RedBlackTree removal
and assume that we will always successfully remove the region.
This is a much more ergonomic option than getting a
`VERIFY_NOT_REACHED()` failure at run-time. I encountered this issue
with Clang, where sized deallocation is not the default due to ABI
breakage concerns.
Note that we can't simply just not declare these functions, because the
C++ standard states:
> If this function with size parameter is defined, the program shall
> also define the version without the size parameter.
The compiler will use these to allocate objects that have alignment
requirements greater than that of our normal `operator new` (4/8 byte
aligned).
This means we can now use smart pointers for over-aligned types.
Fixes a FIXME.
By default, the compiler will assume that `operator new` returns
pointers that are aligned correctly for every built-in type. This is not
the case in the kernel on x64, since the assumed alignment is 16
(because of long double), but the kmalloc blocks are only
`alignas(void*)`.
Thread::yield_and_release_relock_big_lock releases the big lock, yields
and then relocks the big lock.
Thread::yield_assuming_not_holding_big_lock yields assuming the big
lock is not being held.
When blocking on a Lock other than the big lock and we're holding the
big lock, we need to release the big lock first. This fixes some
deadlocks where a thread blocks while holding the big lock, preventing
other threads from getting the big lock in order to unblock the waiting
thread.
When a Lock blocks (e.g. due to a mode mismatch or because someone
else holds it) the lock mode will be updated to what was requested.
There were also some cases where restoring locks may have not worked
as intended as it may have been held already by the same thread.
Fixes#8787
The kernel doesn't currently boot when using an address other than
0xc0000000 because the page tables aren't set up properly for that
but this at least lets us build the kernel.
The 32-bit boot code jumps to 0xc0000000 + entry address once page
tables are set up. This is unnecessary for 64-bit mode because we'll
do another far jump just moments later.
I botched this in 859e5741ff, the check
was supposed to be with Process::is_kernel_process().
This fixes an issue with zombie processes hanging around forever.
Thanks tomuta for spotting it! :^)
Reimplement directory traversal in terms of read_bytes() instead of
doing direct block access. This lets us avoid taking the inode lock
while iterating over the directory contents.
Once we've finalized all the file system metadata in flush_writes(),
we no longer need to hold the file system lock during the call to
BlockBasedFileSystem::flush_writes().
Ext2FS::get_inode() will remember unknown inode indices that it has
been asked about and put them into the inode cache as null inodes.
flush_writes() was not null-checking these while iterating, which
was a bug I finally managed to hit.
Flushing also seemed like a good time to drop unknown inodes from
the cache, since there's no good reason to hold to them indefinitely.
The file system lock is meant to protect the file system metadata
(super blocks, bitmaps, etc.) Not protect processes from reading
independent parts of the disk at once.
This patch introduces a new lock to protect the *block cache* instead,
which is the real thing that needs synchronization.
Forcing users of a FileDescription to seek before they can read/write
makes it inherently racy. This patch adds variants of read/write that
simply ignore the "current offset" of the description in favor of a
caller-supplied offset.
This data structure is a much better fit for what is essentially a
sorted list of non-overlapping ranges.
Not using Vector means we no longer have to worry about Vector buffers
getting huge. Only nice & small allocations from now on.
This adds a new section .ksyms at the end of the linker map, reserves
5MiB for it (which are after end_of_kernel_image so they get re-used
once MemoryManager is initialized) and then embeds the symbol map into
the kernel binary with objcopy. This also shrinks the .ksyms section to
the real size of the symbol file (around 900KiB at the moment).
By doing this we can make the symbol map available much earlier in the
boot process, i.e. even before VFS is available.
We leak a ref() onto every user process when constructing them,
either via Process::create_user_process(), or via Process::sys$fork().
This ref() is balanced by a corresponding unref() in
Thread::WaitBlockCondition::finalize().
Since kernel processes don't have a leaked ref() on them, this led to
an extra Process::unref() on kernel processes during finalization.
This happened during every boot, with the `init_stage2` process.
Found by turning off kfree() scrubbing. :^)
In case we are about to delete the PID directory, we clear the Process
pointer. If someone still holds a reference to the PID directory (by
opening it), we still need to delete the process, but we can't delete
the directory, so we will keep it alive, but any operation on it will
fail by propogating the error to userspace about that the Process was
deleted and therefore there's no meaning to trying to do operations on
the directory.
Fixes#8576.
The C++ standard says that it's legal to call the `delete` operator with
a null pointer argument, in which case it should be a no-op. I
encountered this issue when running a kernel that's compiled with Clang.
I assume this fact was used for some kind of optimization.
It's possible that another thread might try to exit the process just
about the same time another thread does the same, or a crash happens.
Also, we may not be able to kill all other threads instantly as they
may be blocked in the kernel (though in this case they would get killed
before ever returning back to user mode. So keep track of whether
Process::die was already called and ignore it on subsequent calls.
Fixes#8485
We can have multiple PhysicalRegions (often the case when there is a
huge amount of RAM) so we really shouldn't print a debug message any
time someone tries to allocate from one. They will move on to another
region anyway.
We were incorrectly using sizeof(PhysicalPageEntry) for some address
calculations instead of sizeof(PageTableEntry).
It still worked correctly because they happen to be the same size.
We now keep all the PhysicalZones on one of two intrusive lists within
the PhysicalRegion.
The "usable" list contains all zones that can be allocated from,
and the "full" list contains all zones with no free pages.
Instead of creating a PhysicalRegion and then expanding it over and
over as we traverse the memory map on boot, we now compute the final
size of the contiguous physical range up front, and *then* create a
PhysicalRegion object.
Nobody was using this API to request anythign about `PAGE_SIZE`
alignment, so let's get rid of it for now. We can reimplement it if
we end up needing it.
Also note that it wasn't actually used anywhere.
The previous allocator was very naive and kept the state of all pages
in one big bitmap. When allocating, we had to scan through the bitmap
until we found an unset bit.
This patch introduces a new binary buddy allocator that manages the
physical memory pages.
Each PhysicalRegion is divided into zones (PhysicalZone) of 16MB each.
Any extra pages at the end of physical RAM that don't fit into a 16MB
zone are turned into 15 or fewer 1MB zones.
Each zone starts out with one full-sized block, which is then
recursively subdivided into halves upon allocation, until a block of
the request size can be returned.
There are more opportunities for improvement here: the way zone objects
are allocated and stored is non-optimal. Same goes for the allocation
of buddy block state bitmaps.
Threads that don't make syscalls still need to be killed, and we can
do that at any time we want so long the thread is in user mode and
not somehow blocked (e.g. page fault).
This reverts commit 3c3a1726df.
We cannot blindly kill threads just because they're not executing in a
system call. Being blocked (including in a page fault) needs proper
unblocking and potentially kernel stack cleanup before we can mark a
thread as Dying.
Fixes#8691
This enables the Lock class to block a thread even while the thread is
working on a BlockCondition. A thread can still only be either blocked
by a Lock or a BlockCondition.
This also establishes a linked list of threads that are blocked by a
Lock and unblocking directly unlocks threads and wakes them directly.
It's possible that a timer may have been queued to be executed by
the timer irq handler, but if we're in a critical section on the
same processor and are trying to cancel that timer, we would spin
forever waiting for it to be executed.
This re-arranges the order of how things are initialized so that we
try to initialize process and thread management earlier. This is
neccessary because a lot of the code uses the Lock class, which really
needs to have a running scheduler in place so that we can properly
preempt.
This also enables us to potentially initialize some things in parallel.
Instead of each PhysicalPage knowing whether it comes from the
supervisor pages or from the user pages, we can just check in both
sets when freeing a page.
It's just a handful of pointer range checks, nothing expensive.
There appears to be no reason why the process registration needs
to happen under the space spin lock. As the first thread is not started
yet it should be completely uncontested, but it's still bad practice.
If no other thread is ready to be run we don't need to switch to the
idle thread and wait for the next timer interrupt. We can just give
the thread another timeslice and keep it running.