- Use the receiver's per-CPU entry in the message, instead of the
sender's. (Using the sender's entry wasn't safe for broadcast
messages since the same entry ended up on multiple message queues.)
- Retry the CAS until it *succeeds* instead of *fails*. This closes a
race window, and also ensures a correct return value. The return value
is used by the caller to decide whether to broadcast an IPI.
This was the main reason smp=on was so slow. We had CPUs busy-waiting
until someone else triggered an IPI and moved things along.
- Add a CPU pause hint to the spin loop. :^)
It may happen that CPU A manages to page in from the same inode
while we're just entering the same page fault handler on CPU B.
Handle it gracefully by checking if the data has already been paged in
(instead of VERIFY'ing that it hasn't) and then remap the page if that's
the case.
Due to a boolean mistake in smp_return_to_pool(), we didn't retry
pushing the message onto the freelist after a failed attempt.
This caused the message pool to eventually become completely empty
after enough contentious access attempts.
This patch also adds a pause hint to the CPU in the failed attempt
code path.
The kernel panic handler now parses the kernels boot_mode to decide how
to handle the panic. So the previous logic could end up in an panic loop
until we blew out the kernel stack.
Instead only validate the kernel's boot mode once per boot, after
initializing the kernel command line.
Taking a reference or a pointer to a value that's not aligned properly
is undefined behavior. While `[[gnu::packed]]` ensures that reads from
and writes to fields of packed structs is a safe operation, the
information about the reduced alignment is lost when creating pointers
to these values.
Weirdly enough, GCC's undefined behavior sanitizer doesn't flag these,
even though the doc of `-Waddress-of-packed-member` says that it usually
leads to UB. In contrast, x86_64 Clang does flag these, which renders
the 64-bit kernel unable to boot.
For now, the `address-of-packed-member` warning will only be enabled in
the kernel, as it is absolutely crucial there because of KUBSAN, but
might get excessively noisy for the userland in the future.
Also note that we can't append to `CMAKE_CXX_FLAGS` like we do for other
flags in the kernel, because flags added via `add_compile_options` come
after these, so the `-Wno-address-of-packed-member` in the root would
cancel it out.
Kernels built with Clang seem to be quite allocation-heavy compared to
their GCC counterparts. We would sometimes end up crashing during boot
because the eternal ranges had no free capacity.
The Clang error message reads like this (`-Wdeprecated-array-compare`):
> error: comparison between two arrays is deprecated; to compare
> array addresses, use unary '+' to decay operands to pointers.
The maximum number of virtual consoles is determined at compile time,
so we can pre-allocate that many slots, dodging some heap allocations.
Furthermore, virtual consoles are never destroyed, so it's fine to
simply store a raw pointer to the currently active one.
This commit implements the ISO 9660 filesystem as specified in ECMA 119.
Currently, it only supports the base specification and Joliet or Rock
Ridge support is not present. The filesystem will normalize all
filenames to be lowercase (same as Linux).
The filesystem can be mounted directly from a file. Loop devices are
currently not supported by SerenityOS.
Special thanks to Lubrsi for testing on real hardware and providing
profiling help.
Co-Authored-By: Luke <luke.wilde@live.co.uk>
I had to move the thread list out of the protected base area of Process
so that it could live with its lock (which needs to be mutable).
Ideally it would live in the protected area, so maybe we can figure out
a way to do that later.
When I laid down the foundation for the start of the big process lock
separation, I added asserts to all system call implementations to
validate we hold the big process lock in the locations we think we
should be.
Adding that assert to sys$profiling_enable broke boot time profiling as
we were never holding the lock on boot. Even though it's not technically
required, lets make sure to hold the lock while enabling to appease the
assert.
VirtualConsole::m_lock was only used in a single place: on_tty_write()
That function is already protected by a global lock, so this second
lock served no purpose whatsoever.
Note: TCPSocket::create_client() has a dubious locking process where
the sockets by tuple table is first shared lock to check if the socket
exists and bail out if it does, then unlocks, then exclusively locks to
add the tuple. There could be a race condition where two client
creation requests for the same tuple happen at the same time and both
cleared the shared lock check. When in doubt, lock exclusively the
whole time.
A protected value is a variable with enforced locking semantics. The
value is protected with a Mutex and can only be accessed through a
Locked object that holds a MutexLocker to said Mutex. Therefore, the
value itself cannot be accessed except through the proper locking
mechanism, which enforces correct locking semantics.
The Locked object has knowledge of shared and exclusive lock types and
will only return const-correct references and pointers. This should
help catch incorrect locking usage where a shared lock is acquired but
the user then modifies the locked value.
This is not a perfect solution because dereferencing the Locked object
returns the value, so the caller could defeat the protected value
semantics once it acquires a lock by keeping a pointer or a reference
to the value around. Then again, this is C++ and we can't protect
against malicious users from within the kernel anyways, but we can
raise the threshold above "didn't pay attention".
This is some syntaxic sugar to use a ContendedResource object with
reference counting. This effectively dissociates merely holding a
reference to an object and actually using the object in a way that
requires locking it against concurrent use.
This syscall only reads from the shared m_space field, but that field
is only over written to by Process::attach_resources, before the
process was initialized (aka, before syscalls can happen), by
Process::finalize which is only called after all the process' threads
have exited (aka, syscalls can not happen anymore), and by
Process::do_exec which calls all other syscall-capable threads before
doing so. Space's find_region_containing already holds its own lock,
and as such there's no need to hold the big lock.
This syscall doesn't touch any intra-process shared resources and only
accesses the time via the atomic TimeManagement::now so there's no need
to hold the big lock.
This syscall doesn't touch any intra-process shared resources and only
accesses the time via the atomic TimeManagement::current_time so there's
no need to hold the big lock.
This syscall doesn't touch any intra-process shared resources and
reads the time via the atomic TimeManagement::current_time, so it
doesn't need to hold any lock.
When booting AP's, we identity map a region at 0x8000 while doing the
initial bringup sequence. This is the only thing in the kernel that
requires an identity mapping, yet we had a bunch of generic API's and a
dedicated VirtualRangeAllocator in every PageDirectory for this purpose.
This patch simplifies the situation by moving the identity mapping logic
to the AP boot code and removing the generic API's.
...and also RangeAllocator => VirtualRangeAllocator.
This clarifies that the ranges we're dealing with are *virtual* memory
ranges and not anything else.
Now that all KResult and KResultOr are used consistently throughout the
kernel, it's no longer necessary to return negative error codes.
However, we were still doing that in some places, so let's fix all those
(bugs) by removing the minuses. :^)
We were allocating thread FPU state separately in order to ensure a
16-byte alignment. There should be no need to do that.
This patch makes it a regular value member of Thread instead, dodging
one heap allocation during thread creation.
This patch gets rid of the "valid" bit in PageDirectory since it was
only used to communicate an allocation failure during construction.
We now do all the work in the static factory functions instead of in the
constructor, which allows us to simply return nullptr instead of an
"invalid" PageDirectory.
When constructing an AnonymousVMObject with the AllocateNow allocation
strategy we accidentally allocated the committed pages directly through
MemoryManager instead of taking them from our m_unused_physical_pages
CommittedPhysicalPageSet, which meant they were counted as allocated in
MemoryManager, but were still counted as unallocated in the PageSet,
who would then try to uncommit them on destruction, resulting in a
failed assertion.
To help prevent similar issues in the future a Badge<T> was added to
MM::allocate_committed_user_physical_page to prevent allocation of
commited pages not via a CommittedPhysicalPageSet.
When we hit a COW fault and discover than no other process is sharing
the physical page, we simply remap it r/w and save ourselves the
trouble. When this happens, we can also give back (uncommit) one of our
shared committed COW pages, since we won't be needing it.
We had this optimization before, but I mistakenly removed it in
50472fd69f since I had misunderstood
it to be the reason for a panic.
We currently overcommit for COW when forking a process and cloning its
memory regions. Both the parent and child process share a set of.
committed COW pages.
If there's COW sharing across more than two processeses within a lineage
(e.g parent, child & grandchild), it's possible to exhaust these pages.
When the shared set is emptied, the next COW fault in each process must
detach from the shared set and fall back to on demand allocation.
This patch makes sure that we detach from the shared set once we
discover it to be empty (during COW fault handling). This fixes an issue
where we'd try to allocate from an exhausted shared set while building
GNU binutils inside SerenityOS.
It was doing a bunch of things it didn't need to do. I think we had
misunderstood the base class as having copied m_lock in its copy
constructor but it's actually default initialized.
We had issues with committed physical pages getting miscounted in some
situations, and instead of figuring out what was going wrong and making
sure all the commits had matching uncommits, this patch makes the
problem go away by adding an RAII class to manage this instead. :^)
MemoryManager::commit_user_physical_pages() now returns an (optional)
CommittedPhysicalPageSet. You can then allocate pages from the page set
by calling take_one() on it. Any unallocated pages are uncommitted upon
destruction of the page set.
Since we only ever add 1 super physical region, theres no reason to
add the extra redirection and allocation that comes with a dynamically
sized Vector.
We try to read twice from the RTC CMOS registers, and if the values are
not the same for 5 attempts, we know there's a malfunction with the
hardware so we declare these values as bogus in the kernel log.
When we try to query the time from the RTC CMOS, we try to check if
the CMOS is updated. If it is updated for a long period of time (as a
result of hardware malfunction), break the loop and return Unix epoch
time.
Previously we would just print "ASSERTION FAILED: false", which was
kinda cryptic and also didn't make it clear whether this was a TODO or
an unreachable condition. Now, we actually print "ASSERTION FAILED:
TODO", making it crystal clear.
Create the disk cache up front, so we can verify it succeeds.
Make the KBuffer allocation fail-able, so we can properly handle
failure when the user asks up to mount a Ext2 filesystem under
OOM conditions.
The IPv4Socket requires a DoubleBuffer for storage of any data it
received on the socket. However it was previously using the default
constructor which can not observe allocation failure. Address this by
plumbing the receive buffer through the various derived classes.
LocalSockets keep a DoubleBuffer for both client and server usage.
This change converts the usage from using the default constructor
which is unable to observe OOM, to the new try_create factory and
plumb the result through the constructor.
We need to expose the ability for DoubleBuffer creation to expose
failure, as DoubleBuffer depends on KBuffer, which also has to be able
to expose failure during OOM.
We will remove the non OOM API once all users have been converted.
The sys$alarm() syscall has logic to cache a m_alarm_timer to avoid
allocating a new timer for every call to alarm. Unfortunately that
logic was broken, and there were conditions in which we could have
a timer allocated, but it was no longer on the timer queue, and we
would attempt to cancel that timer again resulting in an infinite
loop waiting for the timers callback to fire.
To fix this, we need to track if a timer is currently in use or not,
allowing us to avoid attempting to cancel inactive timers.
Luke and Tom did the initial investigation, I just happened to have
time to write a repro and attempt a fix, so I'm adding them as the
as co-authors of this commit.
Co-authored-by: Luke <luke.wilde@live.co.uk>
Co-authored-by: Tom <tomut@yahoo.com>
Problem:
- New `all_of` implementation takes the entire container so the user
does not need to pass explicit begin/end iterators. This is unused
except is in tests.
Solution:
- Make use of the new and more user-friendly version where possible.
Read the appropriate registers for RTL8139, RTL8168 and E1000.
For NE2000 just assume 10mbit full duplex as there is no indicator
for it in the pure NE2000 spec. Mock values for loopback.
Previously there was no way for Serenity to send a packet without an
established socket connection, and there was no way to appropriately
respond to a SYN packet on a non-listening port. This patch will respond
to any non-established socket attempts with the appropraite RST/ACK,
letting the client know to close the connection.
In accordance with RFC 793, if the receiver is in the SYN-SENT state
it should respond to a RST by aborting the connection and immediately
move to the CLOSED state.
Previously the system would ACK all RST/ACKs, and the remote peer would
just respond with more RST packets.
This isn't needed for Process / Thread as they only reference it
by pointer and it's already part of Kernel/Forward.h. So just include
it where the implementation needs to call it.
I was working on some more KASAN changes and realized the system
no longer links when passing -DENABLE_KERNEL_ADDRESS_SANITIZER=ON.
Prekernel will likely never have KASAN support given it's limited
environment, so just suppress it's usage.
When cloning an AnonymousVMObject, committed COW pages are shared
between the parent and child object. Whicever object COW's first will
take the shared committed page, and if the other object ends up doing
a COW as well, it will notice that the page is no longer shared by
two objects and simple remap it as read/write.
When a child is COW'ed again, while still having shared committed
pages with its own parent, the grandchild object will now join in the
sharing pool with its parent and grandparent. This means that the first
2 of 3 objects that COW will draw from the shared committed pages, and
3rd will remap read/write.
Previously, we would "fork" the shared committed pages when cloning,
which could lead to a situation where the grandparent held on to 1 of
the 3 needed shared committed pages. If both the child and grandchild
COW'ed, they wouldn't have enough pages, and since the grandparent
maintained an extra +1 ref count on the page, it wasn't possible to
to remap read/write.
We currently always crash if a user attempts to unmap a range that
does not intersect with an existing region, no matter the size. This
happens because we will never explicitly check to see if the search
for intersecting regions found anything, instead loop over the results,
which might be an empty vector. We then attempt to deallocate the
requested range from the `RangeAllocator` unconditionally, which will
be invalid if the specified range is not managed by the RangeAllocator.
We will assert validating m_total_range.contains(..) the range we are
requesting to deallocate.
This fix to this is straight forward, error out if we weren't able to
find any intersections.
You can get stress-ng to attempt this pattern with the following
arguments, which will attempt to unmap 0x0 through some large offset:
```
stress-ng --vm-segv 1
```
Fixes: #8483
Co-authored-by: Federico Guerinoni <guerinoni.federico@gmail.com>
AnonymousVMObject::set_volatile() assumes that nobody ever calls it on
non-purgeable objects, so let's make sure we don't do that.
Also return EINVAL instead of EPERM for non-anonymous VM objects so the
error codes match.
Previously it was possible to leak the file descriptor if we error out
after allocating the first descriptor. Now we perform both fd
allocations back to back so we can handle the potential error when
processing the second fd allocation.
The way the Process::FileDescriptions::allocate() API works today means
that two callers who allocate back to back without associating a
FileDescription with the allocated FD, will receive the same FD and thus
one will stomp over the other.
Naively tracking which FileDescriptions are allocated and moving onto
the next would introduce other bugs however, as now if you "allocate"
a fd and then return early further down the control flow of the syscall
you would leak that fd.
This change modifies this behavior by tracking which descriptions are
allocated and then having an RAII type to "deallocate" the fd if the
association is not setup the end of it's scope.
To add a new per-CPU data structure, add an ID for it to the
ProcessorSpecificDataID enum.
Then call ProcessorSpecific<T>::initialize() when you are ready to
construct the per-CPU data structure on the current CPU. It can then
be accessed via ProcessorSpecific<T>::get().
This patch replaces the existing hard-coded mechanisms for Scheduler
and MemoryManager per-CPU data structure.
This enables further work on implementing KASLR by adding relocation
support to the pre-kernel and updating the kernel to be less dependent
on specific virtual memory layouts.
This allows us to specify virtual addresses for things the kernel should
access via virtual addresses later on. By doing this we can make the
kernel independent from specific physical addresses.
Previously the kernel relied on a fixed offset between virtual and
physical addresses based on the kernel's load address. This allows us
to specify an independent offset.
Instead of doing a reset via triple-fault, let's just shutdown the QEMU
virtual machine because this is already a QEMU-specific handling code
for Self-Test CI mode.
In preparation for modifying the Kernel IOCTL API to return KResult
instead of int, we need to fix this ioctl to an argument to receive
it's return value, instead of using the actual function return value.
It's easy to forget the responsibility of validating and safely copying
kernel parameters in code that is far away from syscalls. ioctl's are
one such example, and bugs there are just as dangerous as at the root
syscall level.
To avoid this case, utilize the AK::Userspace<T> template in the ioctl
kernel interface so that implementors have no choice but to properly
validate and copy ioctl pointer arguments.
GCC and Clang allow us to inject a call to a function named
__sanitizer_cov_trace_pc on every edge. This function has to be defined
by us. By noting down the caller in that function we can trace the code
we have encountered during execution. Such information is used by
coverage guided fuzzers like AFL and LibFuzzer to determine if a new
input resulted in a new code path. This makes fuzzing much more
effective.
Additionally this adds a basic KCOV implementation. KCOV is an API that
allows user space to request the kernel to start collecting coverage
information for a given user space thread. Furthermore KCOV then exposes
the collected program counters to user space via a BlockDevice which can
be mmaped from user space.
This work is required to add effective support for fuzzing SerenityOS to
the Syzkaller syscall fuzzer. :^) :^)
When we get a COW fault and discover that whoever we were COW'ing
together with has either COW'ed that page on their end (or they have
unmapped/exited) we simplify life for ourselves by clearing the COW
bit and keeping the page we already have. (No need to COW if the page
is not shared!)
The act of doing this does not return a committed page to the pool.
In fact, that committed page we had reserved for this purpose was used
up (allocated) by our COW buddy when they COW'ed the page.
This fixes a kernel panic when running TestLibCMkTemp. :^)
This makes a kernel panic immediately fail the on-target CI job.
Otherwise the failed job looks like a test timeout unless one digs into
the details of the job.
We don't need an entirely separate VMObject subclass to influence the
location of the physical pages.
Instead, we simply allocate enough physically contiguous memory first,
and then pass it to the AnonymousVMObject constructor that takes a span
of physical pages.
VMObject already has an IntrusiveList of all the Regions that map it.
We were keeping a counter in addition to this, and only using it in
a single place to avoid iterating over the list in case it only had
1 entry.
Simplify VMObject by removing this counter and always iterating the
list even if there's only 1 entry. :^)
This was previously used for a single debug logging statement during
memory purging. There are no remaining users of this weak pointer,
so let's get rid of it.
If a purgeable VM object is in the "volatile" state when we're asked
to make a COW clone of it, make life simpler by simply "purging"
the cloned object right away.
This effectively means that a fork()'ed child process will discover
its purgeable+volatile regions to be empty if/when it tries making
them non-volatile.
This patch changes the semantics of purgeable memory.
- AnonymousVMObject now has a "purgeable" flag. It can only be set when
constructing the object. (Previously, all anonymous memory was
effectively purgeable.)
- AnonymousVMObject now has a "volatile" flag. It covers the entire
range of physical pages. (Previously, we tracked ranges of volatile
pages, effectively making it a page-level concept.)
- Non-volatile objects maintain a physical page reservation via the
committed pages mechanism, to ensure full coverage for page faults.
- When an object is made volatile, it relinquishes any unused committed
pages immediately. If later made non-volatile again, we then attempt
to make a new committed pages reservation. If this fails, we return
ENOMEM to userspace.
mmap() now creates purgeable objects if passed the MAP_PURGEABLE option
together with MAP_ANONYMOUS. anon_create() memory is always purgeable.
Right now, NE2000 NICs don't work because the link is down by default
and this will never change. Of all the NE2000 documentation I looked
at I could not find a link status indicator, so just assume the link
is up.
next_packet_page points to a page, but was being compared to a byte
offset rather than a page offset when adjusting the BOUNDARY register
when the ring buffer wraps around.
Fixes#8327.