For Kernel OOM hardening to work correctly, we need to be able to
call a "nothrow" version of operator new. Unfortunately the default
"throwing" version of operator new assumes that the allocation will
never return on failure and will always throw an exception. This isn't
true in the Kernel, as we don't have exceptions. So if we call the
normal/throwing new and kmalloc returns NULL, the generated code will
happily go and dereference that NULL pointer by invoking the constructor
before we have a chance to handle the failure.
To fix this we declare operator new as noexcept in the Kernel headers,
which will allow the caller to actually handle allocation failure.
The delete implementations need to match the prototype of the new which
allocated them, so we need define delete as noexcept as well. GCC then
errors out declaring that you should implement sized delete as well, so
this change provides those stubs in order to compile cleanly.
Finally the new operator definitions have been standardized as being
declared with [[nodiscard]] to avoid potential memory leaks. So lets
declares the kernel versions that way as well.
We had some inconsistencies before:
- Sometimes "The", sometimes "the"
- Sometimes trailing ".", sometimes no trailing "."
I picked the most common one (lowecase "the", trailing ".") and applied
it to all copyright headers.
By using the exact same string everywhere we can ensure nothing gets
missed during a global search (and replace), and that these
inconsistencies are not spread any further (as copyright headers are
commonly copied to new files).
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
Double kfree() is exceedingly rare in our kernel since we use automatic
memory management and smart pointers for almost all code. However, it
doesn't hurt to do some basic checking that might one day catch bugs.
This patch makes us VERIFY that we don't already consider the first
chunk of a kmalloc() allocation free when kfree()'ing it.
Alot of code is shared between i386/i686/x86 and x86_64
and a lot probably will be used for compatability modes.
So we start by moving the headers into one Directory.
We will probalby be able to move some cpp files aswell.
The system is extremely sensitive to heap allocations during heap
expansion. This was causing frequent OOM panics under various loads.
Work around the issue for now by putting the logging behind
KMALLOC_DEBUG. Ideally dmesgln() & friends would not reqiure any
heap allocations, but we're not there right now.
Fixes#5724.
This may seem like a no-op change, however it shrinks down the Kernel by a bit:
.text -432
.unmap_after_init -60
.data -480
.debug_info -673
.debug_aranges 8
.debug_ranges -232
.debug_line -558
.debug_str -308
.debug_frame -40
With '= default', the compiler can do more inlining, hence the savings.
I intentionally omitted some opportunities for '= default', because they
would increase the Kernel size.
Make more of the kernel compile in 64-bit mode, and make some things
pointer-size-agnostic (by using FlatPtr.)
There's a lot of work to do here before the kernel will even compile.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
There's no real system here, I just added it to various functions
that I don't believe we ever want to call after initialization
has finished.
With these changes, we're able to unmap 60 KiB of kernel text
after init. :^)
If we try to align a number above 0xfffff000 to the next multiple of
the page size (4 KiB), it would wrap around to 0. This is most likely
never what we want, so let's assert if that happens.
Now that we no longer need to support the signal trampolines being
user-accessible inside the kernel memory range, we can get rid of the
"kernel" and "user-accessible" flags on Region and simply use the
address of the region to determine whether it's kernel or user.
This also tightens the page table mapping code, since it can now set
user-accessibility based solely on the virtual address of a page.
The kernel ignored the first 8 MiB of RAM while parsing the memory map
because the kmalloc heaps and the super physical pages lived here. Move
all that stuff inside the .bss segment so that those memory regions are
accounted for, otherwise we risk overwriting boot modules placed next
to the kernel.
This adds the ability for a Region to define volatile/nonvolatile
areas within mapped memory using madvise(). This also means that
memory purging takes into account all views of the PurgeableVMObject
and only purges memory that is not needed by all of them. When calling
madvise() to change an area to nonvolatile memory, return whether
memory from that area was purged. At that time also try to remap
all memory that is requested to be nonvolatile, and if insufficient
pages are available notify the caller of that fact.
If a heap expansion is triggered by allocating from e.g. the
RangeAllocator, which may be holding a spin lock, we cannot
immediately allocate another block of backup memory, which could
require the same locks to be acquired. So, defer allocating the
backup memory
Fixes#4675
When the ExpandableHeap calls the remove_memory function, the
subheap is assumed to be removed and freed entirely. remove_memory
may drop the underlying memory at any time, but it also may cause
further allocation requests. Not removing it from the list before
calling remove_memory could cause a memory allocation in that
subheap while remove_memory is executing. which then causes issues
once the underlying memory is actually freed.
Because allocating/freeing regions may require locks that need to
wait on other processors for completion, this needs to be delayed
until it's safer. Otherwise it is possible to deadlock because we're
holding the global heap lock.
By being a bit too greedy and only allocating how much we need for
the failing allocation, we can end up in an infinite loop trying
to expand the heap further. That's because there are other allocations
(e.g. logging, vmobjects, regions, ...) that happen before we finally
retry the failed allocation request.
Also fix allocating in page size increments, which lead to an assertion
when the heap had to grow more than the 1 MiB backup.
It may be impossible to allocate more backup memory after expanding
the heap if memory is running low. In that case we wouldn't allocate
backup memory until trying to expand the heap again. But we also
wouldn't take advantage of using removed memory as backup, which means
that no backup memory would be available when the heap needs to grow
again, causing subsequent expansion to fail because there is no
backup memory.
The process of expanding memory requires allocations and deallocations
on the heap itself. So, while we're trying to expand the heap, don't
remove memory just because we might briefly not need it. Also prevent
recursive expansion attempts.
Add an ExpandableHeap and switch kmalloc to use it, which allows
for the kmalloc heap to grow as needed.
In order to make heap expansion to work, we keep around a 1 MiB backup
memory region, because creating a region would require space in the
same heap. This means, the heap will grow as soon as the reported
utilization is less than 1 MiB. It will also return memory if an entire
subheap is no longer needed, although that is rarely possible.
Rather than hardcoding where the kmalloc pool should be, place
it at the end of the kernel image instead. This avoids corrupting
global variables or other parts of the kernel as it grows.
Fixes#3257
Rather than hardcoding where the kmalloc pool should be, place
it at the end of the kernel image instead. This avoids corrupting
global variables or other parts of the kernel as it grows.
Fixes#3257
The SI prefixes "k", "M", "G" mean "10^3", "10^6", "10^9".
The IEC prefixes "Ki", "Mi", "Gi" mean "2^10", "2^20", "2^30".
Let's use the correct name, at least in code.
Only changes the name of the constants, no other behavior change.
By having a separate list of constructors for the kernel heap
code, we can properly use constructors without re-running them
after the heap was already initialized. This solves some problems
where values were wiped out because they were overwritten by
running their constructors later in the initialization process.
We can now properly initialize all processors without
crashing by sending SMP IPI messages to synchronize memory
between processors.
We now initialize the APs once we have the scheduler running.
This is so that we can process IPI messages from the other
cores.
Also rework interrupt handling a bit so that it's more of a
1:1 mapping. We need to allocate non-sharable interrupts for
IPIs.
This also fixes the occasional hang/crash because all
CPUs now synchronize memory with each other.
We need to be able to prevent a WaitQueue from being
modified by another CPU. So, add a SpinLock to it.
Because this pushes some other class over the 64 byte
limit, we also need to add another 128-byte bucket to
the slab allocator.
We were getting a little overly memey in some places, so let's scale
things back to business-casual.
Informal language is fine in comments, commits and debug logs,
but let's keep the runtime nice and presentable. :^)
This was supposed to be the foundation for some kind of pre-kernel
environment, but nobody is working on it right now, so let's move
everything back into the kernel and remove all the confusion.
Also, duplicate data in dbg() and klog() calls were removed.
In addition, leakage of virtual address to kernel log is prevented.
This is done by replacing kprintf() calls to dbg() calls with the
leaked data instead.
Also, other kprintf() calls were replaced with klog().
Each allocation header was tracking its index into the chunk bitmap,
but that index can be computed from the allocation address anyway.
Removing this means that each allocation gets 4 more bytes of memory
and this avoids allocating an extra chunk in many cases. :^)
Since we scrub both kmalloc() and kfree() with predictable values, we
can log a helpful message when hitting a crash that looks like it might
be a dereference of such scrubbed data.
Memory validation is used to verify that user syscalls are allowed to
access a given memory range. Ring 0 threads never make syscalls, and
so will never end up in validation anyway.
The reason we were allowing kmalloc memory accesses is because kernel
thread stacks used to be allocated in kmalloc memory. Since that's no
longer the case, we can stop making exceptions for kmalloc in the
validation code.
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.
For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.
Going forward, all new source files should include a license header.
The kernel and its static data structures are no longer identity-mapped
in the bottom 8MB of the address space, but instead move above 3GB.
The first 8MB above 3GB are pseudo-identity-mapped to the bottom 8MB of
the physical address space. But things don't have to stay this way!
Thanks to Jesse who made an earlier attempt at this, it was really easy
to get device drivers working once the page tables were in place! :^)
Fixes#734.
Turns out we can use abi::__cxa_demangle() for this, and all we need to
provide is sprintf(), realloc() and free(), so this patch exposes them.
We now have fully demangled C++ backtraces :^)
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.
The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
Now that the kernel supports startup-time constructors, we were first
doing slab_alloc_init(), and then the constructors ran later on,
zeroing out the freelist pointers.
This meant that all slab allocators thought they were completelty
exhausted and forwarded all requests to kmalloc() instead.
Move the kernel image to the 1 MB physical mark. This prevents it from
colliding with stuff like the VGA memory. This was causing us to end
up with the BIOS screen contents sneaking into kernel memory sometimes.
This patch also bumps the kmalloc heap size from 1 MB to 3 MB. It's not
the perfect permanent solution (obviously) but it should get the OOM
monkey off our backs for a while.
This is obviously not ideal, and it would be better to teach it how to
allocate more pages, etc. But since the physical page allocator itself
currently uses SlabAllocator, it's a little bit tricky :^)
We need these for PhysicalPage objects. Ultimately I'd like to get rid
of these objects entirely, but while we still have to deal with them,
let's at least handle large demand a bit better.
This is a freelist allocator with static size classes that works as a
complement to the generic kmalloc(). It's a lot faster than kmalloc()
since allocation just means popping from the freelist.
It's also significantly more compact when there are a lot of objects
smaller than the minimum kmalloc chunk size (32 bytes.)
This patch enables it for the Region and PhysicalPage classes.
In the PhysicalPage (8 bytes) case, it's a huge improvement since we
no longer waste 75% of the storage allocated.
There are also a number of ways this can be improved, so let's keep
working on it going forward.