When TCP sockets successfully establish a connection, any SO_ERROR
should be cleared back to success. For example, SO_ERROR gets set to
EINPROGRESS on asynchronous connect calls and should be cleared when
the socket moves to the Established state.
This allows for commands like netstat to reference /proc/net and
identify a connection's owning process. Process information is limited
to superusers and user owned processes.
Now that we have a significant amount of code paths handling OOM, lets
enable kmalloc and friends to actually return nullptr. This way we can
start stressing these paths and validating all of they work as expected.
The implementation uses try_copy_kstring_from_user to allocate a kernel
string using, but does not use the length of the resulting string.
The size parameter to the syscall is untrusted, as try copy kstring will
attempt to perform a `safe_strlen(..)` on the user mode string and use
that value for the allocated length of the KString instead. The bug is
that we are printing the kstring, but with the usermode size argument.
During fuzzing this resulted in us walking off the end of the allocated
KString buffer printing garbage (or any kernel data!), until we stumbled
in to the KSym region and hit a fatal page fault.
This is technically a kernel information disclosure, but (un)fortunately
the disclosure only happens to the Bochs debug port, and or the serial
port if serial debugging is enabled. As far as I can tell it's not
actually possible for an untrusted attacker to use this to do something
nefarious, as they would need access to the host. If they have host
access then they can already do much worse things :^).
The only two paths for copying strings in the kernel should be going
through the existing Userspace<char const*>, or StringArgument methods.
Lets enforce this by removing the option for using the raw cstring APIs
that were previously available.
Since the InodeIndex encapsulates a 64 bit value, it is correct to
ensure that the Kernel is exposing the entire value and the LibC is
aware of it.
This commit requires an entire re-compile because it's essentially a
change in the Kernel ABI, together with a corresponding change in LibC.
The compiler can re-order the structure (class) members if that's
necessary, so if we make Process to inherit from ProcFSExposedComponent,
even if the declaration is to inherit first from ProcessBase, then from
ProcFSExposedComponent and last from Weakable<Process>, the members of
class ProcFSExposedComponent (including the Ref-counted parts) are the
first members of the Process class.
This problem made it impossible to safely use the current toggling
method with the write-protection bit on the ProcessBase members, so
instead of inheriting from it, we make its members the last ones in the
Process class so we can safely locate and modify the corresponding page
write protection bit of these values.
We make sure that the Process class doesn't expand beyond 8192 bytes and
the protected values are always aligned on a page boundary.
If you want to record perf events, just enable profiling. This allows us
to add random perf events to programs without littering the file system
with perfcore files.
Making userspace provide a global string ID was silly, and made the API
extremely difficult to use correctly in a global profiling context.
Instead, simply make the kernel do the string ID allocation for us.
This also allows us to convert the string storage to a Vector in the
kernel (and an array in the JSON profile data.)
This syscall allows userspace to register a keyed string that appears in
a new "strings" JSON object in profile output.
This will be used to add custom strings to profile signposts. :^)
We have to disable interrupts before capturing the current Processor*,
or we risk storing the wrong one if we get preempted and resume on a
different CPU.
Caught by the VERIFY in RecursiveSpinLock::unlock()
Currently, to append a UTF-16 view to a StringBuilder, callers must
first convert the view to UTF-8 and then append the copy. Add a UTF-16
overload so callers do not need to hold an entire copy in memory.
This allows tracing the syscalls made by a thread through the kernel's
performance event framework, which is similar in principle to strace.
Currently, this merely logs a stack backtrace to the current thread's
performance event buffer whenever a syscall is made, if profiling is
enabled. Future improvements could include tracing the arguments and
the return value, for example.
Previously, non-blocking read operations on pipes returned EOF instead
of EAGAIN and non-blocking write operations blocked. This fixes
pkgsrc's bmake "eof on job pipe!" error message when running in
non-compatibility mode.
As pointed out by 8infy, this mechanism is racy:
WRITER:
1. ++update1;
2. write_data();
3. ++update2;
READER:
1. do { auto saved = update1;
2. read_data();
3. } while (saved != update2);
The following sequence can lead to a bogus/partial read:
R1 R2 R3
W1 W2 W3
We close this race by incrementing the second update counter first:
WRITER:
1. ++update2;
2. write_data();
3. ++update1;
This fixes the placeholder stub for the SO_ERROR via getsockopt. It
leverages the m_so_error value that each socket maintains. The SO_ERROR
option obtains and then clears this field, which is useful when checking
for errors that occur between socket calls. This uses an integer value
to return the SO_ERROR status.
Resolves#146
This patch adds a vDSO-like mechanism for exposing the current time as
an array of per-clock-source timestamps.
LibC's clock_gettime() calls sys$map_time_page() to map the kernel's
"time page" into the process address space (at a random address, ofc.)
This is only done on first call, and from then on the timestamps are
fetched from the time page.
This first patch only adds support for CLOCK_REALTIME, but eventually
we should be able to support all clock sources this way and get rid of
sys$clock_gettime() in the kernel entirely. :^)
Accesses are synchronized using two atomic integers that are incremented
at the start and finish of the kernel's time page update cycle.
Leave interrupts enabled so that we can still process IRQs. Critical
sections should only prevent preemption by another thread.
Co-authored-by: Tom <tomut@yahoo.com>
By making these functions static we close a window where we could get
preempted after calling Processor::current() and move to another
processor.
Co-authored-by: Tom <tomut@yahoo.com>
This removes Pipes dependency on the UHCIController by introducing a
controller base class. This will be used to implement other controllers
such as OHCI.
Additionally, there can be multiple instances of a UHCI controller.
For example, multiple UHCI instances can be required for systems with
EHCI controllers. EHCI relies on using multiple of either UHCI or OHCI
controllers to drive USB 1.x devices.
This means UHCIController can no longer be a singleton. Multiple
instances of it can now be created and passed to the device and then to
the pipe.
To handle finding and creating these instances, USBManagement has been
introduced. It has the same pattern as the other management classes
such as NetworkManagement.
Port2 logic was errantly using `portsc1` registers, meaning that
the port wouldn't be reset properly. In effect, this puts devices
connected to Port2 in an undefined state.
Processing SMP messages outside of non-SMP mode is a waste of time,
and now that we don't rely on the side effects of calling the message
processing function, let's stop calling it entirely. :^)
We were previously relying on a side effect of the critical section in
smp_process_pending_messages(): when exiting that section, it would
process any pending deferred calls.
Instead of relying on that, make the deferred invocations explicit by
calling deferred_call_execute_pending() in exit_trap().
This ensures that deferred calls get processed before entering the
scheduler at the end of exit_trap(). Since thread unblocking happens
via deferred calls, the threads don't have to wait until the next
scheduling opportunity when they could be ready *now*. :^)
This was the main reason Tom's SMP branch ran slowly in non-SMP mode.
Enter a critical section in Processor::exit_trap so that processing
SMP messages doesn't enable interrupts upon leaving. We need to delay
this until the end where we call into the Scheduler if exiting the
trap results in being outside of a critical section and irq handler.
Co-authored-by: Tom <tomut@yahoo.com>
First off: unregister the region from MemoryManager before unmapping it.
The order of operations here was a bit strange, presumably to avoid a
situation where a fault would happen while unmapping, and the fault
handler would find the MemoryManager region list in an invalid state.
Unregistering it before unmapping sidesteps the whole problem, and
allows us to easily fix another problem: a deadlock could occur due
to inconsistent acquisition order (PageDirectory must come before MM.)
We don't want to be holding the MM lock if it's a user region and we
have to consult the page directory, since that can lead to a deadlock
if we don't already have the page directory lock.
- Use the receiver's per-CPU entry in the message, instead of the
sender's. (Using the sender's entry wasn't safe for broadcast
messages since the same entry ended up on multiple message queues.)
- Retry the CAS until it *succeeds* instead of *fails*. This closes a
race window, and also ensures a correct return value. The return value
is used by the caller to decide whether to broadcast an IPI.
This was the main reason smp=on was so slow. We had CPUs busy-waiting
until someone else triggered an IPI and moved things along.
- Add a CPU pause hint to the spin loop. :^)
It may happen that CPU A manages to page in from the same inode
while we're just entering the same page fault handler on CPU B.
Handle it gracefully by checking if the data has already been paged in
(instead of VERIFY'ing that it hasn't) and then remap the page if that's
the case.
Due to a boolean mistake in smp_return_to_pool(), we didn't retry
pushing the message onto the freelist after a failed attempt.
This caused the message pool to eventually become completely empty
after enough contentious access attempts.
This patch also adds a pause hint to the CPU in the failed attempt
code path.
The kernel panic handler now parses the kernels boot_mode to decide how
to handle the panic. So the previous logic could end up in an panic loop
until we blew out the kernel stack.
Instead only validate the kernel's boot mode once per boot, after
initializing the kernel command line.
Taking a reference or a pointer to a value that's not aligned properly
is undefined behavior. While `[[gnu::packed]]` ensures that reads from
and writes to fields of packed structs is a safe operation, the
information about the reduced alignment is lost when creating pointers
to these values.
Weirdly enough, GCC's undefined behavior sanitizer doesn't flag these,
even though the doc of `-Waddress-of-packed-member` says that it usually
leads to UB. In contrast, x86_64 Clang does flag these, which renders
the 64-bit kernel unable to boot.
For now, the `address-of-packed-member` warning will only be enabled in
the kernel, as it is absolutely crucial there because of KUBSAN, but
might get excessively noisy for the userland in the future.
Also note that we can't append to `CMAKE_CXX_FLAGS` like we do for other
flags in the kernel, because flags added via `add_compile_options` come
after these, so the `-Wno-address-of-packed-member` in the root would
cancel it out.
Kernels built with Clang seem to be quite allocation-heavy compared to
their GCC counterparts. We would sometimes end up crashing during boot
because the eternal ranges had no free capacity.
The Clang error message reads like this (`-Wdeprecated-array-compare`):
> error: comparison between two arrays is deprecated; to compare
> array addresses, use unary '+' to decay operands to pointers.
The maximum number of virtual consoles is determined at compile time,
so we can pre-allocate that many slots, dodging some heap allocations.
Furthermore, virtual consoles are never destroyed, so it's fine to
simply store a raw pointer to the currently active one.
This commit implements the ISO 9660 filesystem as specified in ECMA 119.
Currently, it only supports the base specification and Joliet or Rock
Ridge support is not present. The filesystem will normalize all
filenames to be lowercase (same as Linux).
The filesystem can be mounted directly from a file. Loop devices are
currently not supported by SerenityOS.
Special thanks to Lubrsi for testing on real hardware and providing
profiling help.
Co-Authored-By: Luke <luke.wilde@live.co.uk>
I had to move the thread list out of the protected base area of Process
so that it could live with its lock (which needs to be mutable).
Ideally it would live in the protected area, so maybe we can figure out
a way to do that later.
When I laid down the foundation for the start of the big process lock
separation, I added asserts to all system call implementations to
validate we hold the big process lock in the locations we think we
should be.
Adding that assert to sys$profiling_enable broke boot time profiling as
we were never holding the lock on boot. Even though it's not technically
required, lets make sure to hold the lock while enabling to appease the
assert.
VirtualConsole::m_lock was only used in a single place: on_tty_write()
That function is already protected by a global lock, so this second
lock served no purpose whatsoever.
Note: TCPSocket::create_client() has a dubious locking process where
the sockets by tuple table is first shared lock to check if the socket
exists and bail out if it does, then unlocks, then exclusively locks to
add the tuple. There could be a race condition where two client
creation requests for the same tuple happen at the same time and both
cleared the shared lock check. When in doubt, lock exclusively the
whole time.
A protected value is a variable with enforced locking semantics. The
value is protected with a Mutex and can only be accessed through a
Locked object that holds a MutexLocker to said Mutex. Therefore, the
value itself cannot be accessed except through the proper locking
mechanism, which enforces correct locking semantics.
The Locked object has knowledge of shared and exclusive lock types and
will only return const-correct references and pointers. This should
help catch incorrect locking usage where a shared lock is acquired but
the user then modifies the locked value.
This is not a perfect solution because dereferencing the Locked object
returns the value, so the caller could defeat the protected value
semantics once it acquires a lock by keeping a pointer or a reference
to the value around. Then again, this is C++ and we can't protect
against malicious users from within the kernel anyways, but we can
raise the threshold above "didn't pay attention".
This is some syntaxic sugar to use a ContendedResource object with
reference counting. This effectively dissociates merely holding a
reference to an object and actually using the object in a way that
requires locking it against concurrent use.
This syscall only reads from the shared m_space field, but that field
is only over written to by Process::attach_resources, before the
process was initialized (aka, before syscalls can happen), by
Process::finalize which is only called after all the process' threads
have exited (aka, syscalls can not happen anymore), and by
Process::do_exec which calls all other syscall-capable threads before
doing so. Space's find_region_containing already holds its own lock,
and as such there's no need to hold the big lock.
This syscall doesn't touch any intra-process shared resources and only
accesses the time via the atomic TimeManagement::now so there's no need
to hold the big lock.
This syscall doesn't touch any intra-process shared resources and only
accesses the time via the atomic TimeManagement::current_time so there's
no need to hold the big lock.
This syscall doesn't touch any intra-process shared resources and
reads the time via the atomic TimeManagement::current_time, so it
doesn't need to hold any lock.
When booting AP's, we identity map a region at 0x8000 while doing the
initial bringup sequence. This is the only thing in the kernel that
requires an identity mapping, yet we had a bunch of generic API's and a
dedicated VirtualRangeAllocator in every PageDirectory for this purpose.
This patch simplifies the situation by moving the identity mapping logic
to the AP boot code and removing the generic API's.
...and also RangeAllocator => VirtualRangeAllocator.
This clarifies that the ranges we're dealing with are *virtual* memory
ranges and not anything else.
Now that all KResult and KResultOr are used consistently throughout the
kernel, it's no longer necessary to return negative error codes.
However, we were still doing that in some places, so let's fix all those
(bugs) by removing the minuses. :^)
We were allocating thread FPU state separately in order to ensure a
16-byte alignment. There should be no need to do that.
This patch makes it a regular value member of Thread instead, dodging
one heap allocation during thread creation.
This patch gets rid of the "valid" bit in PageDirectory since it was
only used to communicate an allocation failure during construction.
We now do all the work in the static factory functions instead of in the
constructor, which allows us to simply return nullptr instead of an
"invalid" PageDirectory.
When constructing an AnonymousVMObject with the AllocateNow allocation
strategy we accidentally allocated the committed pages directly through
MemoryManager instead of taking them from our m_unused_physical_pages
CommittedPhysicalPageSet, which meant they were counted as allocated in
MemoryManager, but were still counted as unallocated in the PageSet,
who would then try to uncommit them on destruction, resulting in a
failed assertion.
To help prevent similar issues in the future a Badge<T> was added to
MM::allocate_committed_user_physical_page to prevent allocation of
commited pages not via a CommittedPhysicalPageSet.
When we hit a COW fault and discover than no other process is sharing
the physical page, we simply remap it r/w and save ourselves the
trouble. When this happens, we can also give back (uncommit) one of our
shared committed COW pages, since we won't be needing it.
We had this optimization before, but I mistakenly removed it in
50472fd69f since I had misunderstood
it to be the reason for a panic.
We currently overcommit for COW when forking a process and cloning its
memory regions. Both the parent and child process share a set of.
committed COW pages.
If there's COW sharing across more than two processeses within a lineage
(e.g parent, child & grandchild), it's possible to exhaust these pages.
When the shared set is emptied, the next COW fault in each process must
detach from the shared set and fall back to on demand allocation.
This patch makes sure that we detach from the shared set once we
discover it to be empty (during COW fault handling). This fixes an issue
where we'd try to allocate from an exhausted shared set while building
GNU binutils inside SerenityOS.