Commit Graph

168 Commits

Author SHA1 Message Date
Andreas Kling
62c45850e1 Kernel: Page allocation should not use memset_user() when zeroing
We're not zeroing new pages through a userspace address, so this should
not use memset_user().
2020-01-10 10:57:33 +01:00
Andreas Kling
197e73ee31 Kernel+LibELF: Enable SMAP protection during non-syscall exec()
When loading a new executable, we now map the ELF image in kernel-only
memory and parse it there. Then we use copy_to_user() when initializing
writable regions with data from the executable.

Note that the exec() syscall still disables SMAP protection and will
require additional work. This patch only affects kernel-originated
process spawns.
2020-01-10 10:57:06 +01:00
Andreas Kling
8e7420ddf2 Kernel: Harden memory mapping of the kernel image
We now map the kernel's text and rodata segments read+execute.
We also make the data and bss segments non-executable.

Thanks to q3k for the idea! :^)
2020-01-06 13:55:39 +01:00
Andreas Kling
9eef39d68a Kernel: Start implementing x86 SMAP support
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.

Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.

There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.

THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.

Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
2020-01-05 18:14:51 +01:00
Andreas Kling
aba7829724 Kernel: InodeVMObject can't call Inode::size() with interrupts disabled
Inode::size() may try to take a lock, so we can't be calling it with
interrupts disabled.

This fixes a kernel hang when trying to execute a binary in a TmpFS.
2020-01-03 15:40:03 +01:00
Andreas Kling
0f9800ca57 Kernel: Make the loop that marks the bottom 1MB NX a little less busy 2020-01-02 22:02:29 +01:00
Andreas Kling
32ec1e5aed Kernel: Mask kernel addresses in backtraces and profiles
Addresses outside the userspace virtual range will now show up as
0xdeadc0de in backtraces and profiles generated by unprivileged users.
2020-01-02 20:51:31 +01:00
Andreas Kling
3dcec260ed Kernel: Validate the full range of user memory passed to syscalls
We now validate the full range of userspace memory passed into syscalls
instead of just checking that the first and last byte of the memory are
in process-owned regions.

This fixes an issue where it was possible to avoid rejection of invalid
addresses that sat between two valid ones, simply by passing a valid
address and a size large enough to put the end of the range at another
valid address.

I added a little test utility that tries to provoke EFAULT in various
ways to help verify this. I'm sure we can think of more ways to test
this but it's at least a start. :^)

Thanks to mozjag for pointing out that this code was still lacking!

Incidentally this also makes backtraces work again.

Fixes #989.
2020-01-02 02:17:12 +01:00
Andreas Kling
ea1911b561 Kernel: Share code between Region::map() and Region::remap_page()
These were doing mostly the same things, so let's just share the code.
2020-01-01 19:32:55 +01:00
Andreas Kling
5aeaab601e Kernel: Move CPU feature detection to Arch/x86/CPU.{cpp.h}
We now refuse to boot on machines that don't support PAE since all
of our paging code depends on it.

Also let's only enable SSE and PGE support if the CPU advertises it.
2020-01-01 12:57:00 +01:00
Andreas Kling
8602fa5b49 Kernel: Enable x86 SMEP (Supervisor Mode Execution Protection)
This prevents the kernel from jumping to code in userspace memory.
2020-01-01 01:59:52 +01:00
Andreas Kling
c9ec415e2f Kernel: Always reject never-userspace addresses before checking regions
At the moment, addresses below 8MB and above 3GB are never accessible
to userspace, so just reject them without even looking at the current
process's memory regions.
2019-12-31 03:45:54 +01:00
Andreas Kling
66d5ebafa6 Kernel: Let's also not consider kernel regions to be valid user stacks
This one is less obviously exploitable than the previous one, but still
a bug nonetheless.
2019-12-31 00:28:14 +01:00
Andreas Kling
0fc24fe256 Kernel: User pointer validation should reject kernel-only addresses
We were happily allowing syscalls with pointers into kernel-only
regions (virtual address >= 0xc0000000).

This patch fixes that by only considering user regions in the current
process, and also double-checking the Region::is_user_accessible() flag
before approving an access.

Thanks to Fire30 for finding the bug! :^)
2019-12-31 00:24:35 +01:00
Andreas Kling
1f31156173 Kernel: Add a mode flag to sys$purge and allow purging clean inodes 2019-12-29 13:16:53 +01:00
Andreas Kling
c74cde918a Kernel+SystemMonitor: Expose amount of per-process clean inode memory
This is memory that's loaded from an inode (file) but not modified in
memory, so still identical to what's on disk. This kind of memory can
be freed and reloaded transparently from disk if needed.
2019-12-29 12:45:58 +01:00
Andreas Kling
0d5e0e4cad Kernel+SystemMonitor: Expose amount of per-process dirty private memory
Dirty private memory is all memory in non-inode-backed mappings that's
process-private, meaning it's not shared with any other process.

This patch exposes that number via SystemMonitor, giving us an idea of
how much memory each process is responsible for all on its own.
2019-12-29 12:28:32 +01:00
Andreas Kling
c1f8291ce4 Kernel: When physical page allocation fails, try to purge something
Instead of panicking right away when we run out of physical pages,
we now try to find a PurgeableVMObject with some volatile pages in it.
If we find one, we purge that entire object and steal one of its pages.

This makes it possible for the kernel to keep going instead of dying.
Very cool. :^)
2019-12-26 11:45:36 +01:00
Conrad Pankoff
17aef7dc99 Kernel: Detect support for no-execute (NX) CPU features
Previously we assumed all hosts would have support for IA32_EFER.NXE.
This is mostly true for newer hardware, but older hardware will crash
and burn if you try to use this feature.

Now we check for support via CPUID.80000001[20].
2019-12-26 10:05:51 +01:00
Andreas Kling
9e55bcb7da Kernel: Make kernel memory regions be non-executable by default
From now on, you'll have to request executable memory specifically
if you want some.
2019-12-25 22:41:34 +01:00
Andreas Kling
0b7a2e0a5a Kernel: Set NX bit for virtual addresses 0-1MB and 2-8MB
This removes the ability to jump into kmalloc memory, etc.
Only the kernel image itself is allowed to exec, located between 1-2MB.
2019-12-25 22:24:28 +01:00
Andreas Kling
ce5f7f6c07 Kernel: Use the CPU's NX bit to enforce PROT_EXEC on memory mappings
Now that we have PAE support, we can ask the CPU to crash processes for
trying to execute non-executable memory. This is pretty cool! :^)
2019-12-25 13:35:57 +01:00
Andreas Kling
52deb09382 Kernel: Enable PAE (Physical Address Extension)
Introduce one more (CPU) indirection layer in the paging code: the page
directory pointer table (PDPT). Each PageDirectory now has 4 separate
PageDirectoryEntry arrays, governing 1 GB of VM each.

A really neat side-effect of this is that we can now share the physical
page containing the >=3GB kernel-only address space metadata between
all processes, instead of lazily cloning it on page faults.

This will give us access to the NX (No eXecute) bit, allowing us to
prevent execution of memory that's not supposed to be executed.
2019-12-25 13:35:57 +01:00
Andreas Kling
c087abc48d Kernel: Rename PageDirectory::find_by_pdb() => find_by_cr3()
I caught myself wondering what "pdb" stood for, so let's rename this
to something more obvious.
2019-12-25 02:58:03 +01:00
Andreas Kling
7a0088c4d2 Kernel: Clean up Region access bit setters a little 2019-12-25 02:58:03 +01:00
Andreas Kling
c9a5253ac2 Kernel: Uh, actually *actually* turn on CR4.PGE
I'm not sure how I managed to misread the location of this bit twice.
But I did! Here is finally the correct value, according to Intel:

    "Page Global Enable (bit 7 of CR4)"

Jeez! :^)
2019-12-25 02:58:03 +01:00
Andreas Kling
3623e35978 Kernel: Oops, actually enable CR4.PGE (page table global bit)
Turns out we were setting the wrong bit here. Now we will actually keep
kernel memory mappings in the TLB across context switches.
2019-12-24 22:45:27 +01:00
Andreas Kling
ae2d72377d Kernel: Enable the x86 WP bit to catch invalid memory writes in ring 0
Setting this bit will cause the CPU to generate a page fault when
writing to read-only memory, even if we're executing in the kernel.

Seemingly the only change needed to make this work was to have the
inode-backed page fault handler use a temporary mapping for writing
the read-from-disk data into the newly-allocated physical page.
2019-12-21 16:21:13 +01:00
Andreas Kling
62c2309336 Kernel: Fix some warnings about passing non-POD to kprintf 2019-12-20 20:19:46 +01:00
Andreas Kling
b6ee8a2c8d Kernel: Rename vmo => vmobject everywhere 2019-12-19 19:15:27 +01:00
Andreas Kling
1d4d6f16b2 Kernel: Add a specific-page variant of Region::commit() 2019-12-18 22:43:32 +01:00
Andreas Kling
0a75a46501 Kernel: Make sure the kernel info page is read-only for userspace
To enforce this, we create two separate mappings of the same underlying
physical page. A writable mapping for the kernel, and a read-only one
for userspace (the one returned by sys$get_kernel_info_page.)
2019-12-15 22:21:28 +01:00
Andreas Kling
931e4b7f5e Kernel+SystemMonitor: Prevent userspace access to process ELF image
Every process keeps its own ELF executable mapped in memory in case we
need to do symbol lookup (for backtraces, etc.)

Until now, it was mapped in a way that made it accessible to the
program, despite the program not having mapped it itself.
I don't really see a need for userspace to have access to this right
now, so let's lock things down a little bit.

This patch makes it inaccessible to userspace and exposes that fact
through /proc/PID/vm (per-region "user_accessible" flag.)
2019-12-15 20:11:57 +01:00
Andreas Kling
05a441afb2 Kernel: Don't turn private read-only regions into shared ones on fork
Even if they are read-only now, they can be mprotect(PROT_WRITE)'d in
the future, so we have to make sure they are CoW mapped.
2019-12-15 16:53:46 +01:00
Andreas Kling
3fbc50a350 Kernel+SystemMonitor: Expose the number of set CoW bits in each Region
This number tells us how many more pages in a given region will trigger
a CoW fault if written to.
2019-12-15 16:53:00 +01:00
Andreas Kling
9ad151c665 Kernel: Improve comment about the system virtual memory map a bit 2019-12-15 16:13:08 +01:00
Andreas Kling
65229a4082 Kernel: Move VMObject::for_each_region() to MemoryManager.h
It can't be in VMObject.h since it depends on MemoryManager.h
2019-12-09 20:06:03 +01:00
Andreas Kling
a22b7f96fc Kernel: Remap all regions referring to a PurgeableVMObject on purge
Otherwise we won't get page faults next time you try to access the
purged memory.
2019-12-09 20:05:04 +01:00
Andreas Kling
dbb644f20c Kernel: Start implementing purgeable memory support
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():

- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)

When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)

Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
2019-12-09 19:12:38 +01:00
Andreas Kling
05c65fb4f1 Kernel: Don't CoW non-writable pages
A page fault in a page marked for CoW should not trigger a CoW if the
page is non-writable. I think this makes sense.
2019-12-02 19:20:09 +01:00
Andreas Kling
f41ae755ec Kernel: Crash on memory access in non-readable regions
This patch makes it possible to make memory regions non-readable.
This is enforced using the "present" bit in the page tables.
A process that hits an not-present page fault in a non-readable
region will be crashed.
2019-12-02 19:18:52 +01:00
Andreas Kling
7dc9c90f83 Kernel: Fix bug where mprotect() would ignore setting PROT_WRITE
A typo in Region::set_writable() caused us to update the readable flag
rather than the writable flag.
2019-12-02 18:15:36 +01:00
Andreas Kling
cde0a1eeb5 Kernel: Put some debug spam behind PAGE_FAULT_DEBUG 2019-12-01 16:03:24 +01:00
Andreas Kling
e56daf547c Kernel: Disallow syscalls from writeable memory
Processes will now crash with SIGSEGV if they attempt making a syscall
from PROT_WRITE memory.

This neat idea comes from OpenBSD. :^)
2019-11-29 16:30:05 +01:00
Andreas Kling
2d1bcce34a Kernel: Fix triple-fault when clicking on SystemServer in SystemMonitor
The fault was happening when retrieving a current backtrace for the
SystemServer process.

To generate a backtrace, we go into the paging scope of the process,
meaning we temporarily switch to using its page directory as our own.

Because kernel VM is allocated on demand, it's possible for a process's
mappings above the 3GB mark to be out-of-date. Normally this just gets
fixed up transparently by the page fault handler (which simply copies
the PDE from the canonical MM.kernel_page_directory() into the current
process.)

However, if the current kernel *stack* is in a piece of memory that
the backtraced process lacks up-to-date PDE's for, we still get a page
fault, but are unable to handle it, since the CPU wants to push to the
stack as part of calling the page fault handler. So we're screwed and
it's a triple-fault.

Fix this by always updating the kernel VM mappings before switching
into a paging scope. In practical terms, this is a 1KB memcpy() that
happens when generating a backtrace, or doing exec().
2019-11-27 12:40:42 +01:00
Andreas Kling
5b8cf2ee23 Kernel: Make syscall counters and page fault counters per-thread
Now that we show individual threads in SystemMonitor and "top",
it's also very nice to have individual counters for the threads. :^)
2019-11-26 21:37:38 +01:00
Andreas Kling
3dc87be891 Kernel: Mark mmap()-created regions with a special bit
Then only allow regions with that bit to be manipulated via munmap()
and mprotect(). This prevents messing with non-mmap()ed regions in
a process's address space (stacks, shared buffers, ...)
2019-11-24 12:26:21 +01:00
Andreas Kling
9a157b5e81 Revert "Kernel: Move Kernel mapping to 0xc0000000"
This reverts commit bd33c66273.

This broke the network card drivers, since they depended on kmalloc
addresses being identity-mapped.
2019-11-23 17:27:09 +01:00
Jesse Buhagiar
bd33c66273 Kernel: Move Kernel mapping to 0xc0000000
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.

The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
2019-11-22 16:23:23 +01:00
Andreas Kling
794758df3a Kernel: Implement some basic stack pointer validation
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.

If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.

Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
2019-11-17 12:15:43 +01:00