These are formatters that can only be used with debug print
functions, such as dbgln(). Currently this is limited to
Formatter<ErrorOr<T>>. With this you can still debug log ErrorOr
values (good for debugging), but trying to use them in any
String::formatted() call will fail (which prevents .to_string()
errors with the new failable strings being ignored).
You make a formatter debug only by adding a constexpr method like:
static constexpr bool is_debug_only() { return true; }
When calling ioctl on a socket with SIOCGIFHWADDR, return the correct
physical interface type. This value was previously hardcoded to
ARPHRD_ETHER (Ethernet), and now can also return ARPHRD_LOOPBACK for the
loopback adapter.
Before this patch, Core::SessionManagement::parse_path_with_sid() would
figure out the root session ID by sifting through /sys/kernel/processes.
That file can take quite a while to generate (sometimes up to 40ms on my
machine, which is a problem on its own!) and with no caching, many of
our programs were effectively doing this multiple times on startup when
unveiling something in /tmp/session/%sid/
While we should find ways to make generating /sys/kernel/processes fast
again, this patch addresses the specific problem by introducing a new
syscall: sys$get_root_session_id(). This extracts the root session ID
by looking directly at the process table and takes <1ms instead of 40ms.
This cuts WebContent process startup time by ~100ms on my machine. :^)
Resolves issue where a panic would occur if the file system failed to
initialize or mount, due to how the FileSystem was already added to
VFS's list. The newly-created FileSystem destructor would fail as a
result of the object still remaining in the IntrusiveList.
Nobody tests this network card as the person who added it, Jean-Baptiste
Boric (known as boricj) is not an active contributor in the project now.
After a discussion with him on the Discord server, we agreed it's for
the best to remove the driver, as for two reasons:
- The original author (boricj) agreed to do this, stating that he will
not be able to test the driver anymore after his Athlon XP machine is
no longer supported after the removal of the i686 port.
- It was agreed that the NE2000 network card family is far from the
ideal hardware we would want to support, similarly to the RTL8139 that
got removed recently for almost the same reason.
Instead of using a clunky switch-case paradigm, we now have all drivers
being declaring two methods for their adapter class - create and probe.
These methods are linked in each PCIGraphicsDriverInitializer structure,
in a new s_initializers static list of them.
Then, when we probe for a PCI device, we use each probe method and if
there's a match, then the corresponding create method is called.
As a result of this change, it's much more easy to add more drivers and
the initialization code is more readable.
We try our best to ensure a DisplayConnector initialization succeeds,
and this makes the Intel driver to work again, because if we can't
allocate a Region for the whole PCI BAR mapped region, then we will try
to allocate a Region with 16 MiB window size, so it doesn't eat the
entire Kernel-allocated virtual memory space.
Instead of just returning nothing, let's return Error or nothing.
This would help later on with error propagation in case of failure
during this method.
This also makes us more paranoid about failure in this method, so when
initializing a DisplayConnector we safely tear down the internal members
of the object. This applies the same for a StorageDevice object, but its
after_inserting method is much smaller compared to the DisplayConnector
overriden method.
Nobody tests this network card, and the driver has bugs (see the issue
https://github.com/SerenityOS/serenity/issues/10198 for more details),
so it's almost certain that this happened due to code being rotting when
there's simply no testing of it.
Essentially this has been determined to be dead-code so this is the most
important reason to drop this code. Another good reason to do so is
because the RTL8139 only supports Fast Ethernet connections (10/100
Megabits per second), and is considered obsolete even for bare metal
setups.
Instead of using a clunky if-statement paradigm, we now have all drivers
being declaring two methods for their adapter class - create and probe.
These methods are linked in each PCINetworkDriverInitializer structure,
in a new s_initializers static list of them.
Then, when we probe for a PCI device, we use each probe method and if
there's a match, then the corresponding create method is called. After
the adapter instance is created, we call the virtual initialize method
on it, because many drivers actually require a sort of post-construction
initialization sequence to ensure the network adapter can properly
function.
As a result of this change, it's much more easy to add more drivers and
the initialization code is more readable and it's easier to understand
when and where things could fail in the whole initialization sequence.
Instead of allocating those regions in the constructor, which makes it
impossible to fail in case of OOM condition, allocate them in the static
factory method so we could propagate errors in case of failure.
Instead of allocating after the construction point ensure that all Intel
drivers are allocating necessary buffer regions and then pass them to
the constructors.
This could let us fail early in case of OOM, so we don't touch a network
adapter before we ensure we have all the appropriate mappings in place.
We really don't want callers of this function to accidentally change
the jail, or even worse - remove the Process from an attached jail.
To ensure this never happens, we can just declare this method as const
so nobody can mutate it this way.
Use this helper function in various places to replace the old code of
acquiring the SpinlockProtected<RefPtr<Jail>> of a Process to do that
validation.
This seems to work perfectly OK on my ICH7 test machine and also it
works on QEMU, so it is probably OK to restore this.
This will ensure we always get scan code set 1 input, because we enable
scan code set 2 and PS/2 translation on the first (keyboard) port.
The setting of scan code set sequence is removed, as it's buggy and
could lead the controller to fail immediately when doing self-test
afterwards. We will restore it when we understand how to do so safely.
Allow the user to determine a preferred detection path with a new kernel
command line argument. The defualt option is to check i8042 presence
with an ACPI check and if necessary - an "aggressive" test to determine
i8042 existence in the system.
Also, keep the i8042 controller pointer on the stack, so don't assign
m_i8042_controller member pointer if it does not exist.
Only do so after a brief check if we are in a Jail or not. This fixes
SMP, because apparently it is crashing when calling try_generate()
from the SysFSGlobalInformation::refresh_data method, so the fix for
this is to simply not do that inside the Process' Jail spinlock scope,
because otherwise we will simply have a possible flow of taking
multiple conflicting Spinlocks (in the wrong order multiple times), for
the SysFSOverallProcesses generation code:
Process::current().jail(), and then Process::for_each_in_same_jail being
called, we take Process::all_instances(), and Process::current().jail()
again.
Therefore, we should at the very least eliminate the first taking of the
Process::current().jail() spinlock, in the refresh_data method of the
SysFSGlobalInformation class.
A virtual method named device_name() was added to
Kernel::PCI to support logging the PCI::Device name
and address using dmesgln_pci. Previously, PCI::Device
did not store the device name.
All devices inheriting from PCI::Device now use dmesgln_pci where
they previously used dmesgln.
`inline` already assigns vague linkage, so there's no need to
also assign per-TU linkage. Allows the linker to dedup these
functions across TUs (and is almost always just the Right Thing
to do in C++ -- this ain't C).
* Fix bug where last character of a filename or extension would be
truncated (HELLO.TXT -> HELL.TX).
* Fix bug where additional NULL characters would be added to long
filenames that did not completely fill one of the Long Filename Entry
character fields.
There are places in the kernel that would like to have access
to `pgid` credentials in certain circumstances.
I haven't found any use cases for `sid` yet, but `sid` and `pgid` are
both changed with `sys$setpgid`, so it seemed sensical to add it.
In Linux, `man 7 credentials` also mentions both the session id and
process group id, so this isn't unprecedented.
These instances were detected by searching for files that include
AK/Memory.h, but don't match the regex:
\\b(fast_u32_copy|fast_u32_fill|secure_zero|timing_safe_compare)\\b
This regex is pessimistic, so there might be more files that don't
actually use any memory function.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
These instances were detected by searching for files that include
AK/Concepts.h, but don't match the regex:
\\b(AnyString|Arithmetic|ArrayLike|DerivedFrom|Enum|FallibleFunction|Flo
atingPoint|Fundamental|HashCompatible|Indexable|Integral|IterableContain
er|IteratorFunction|IteratorPairWith|OneOf|OneOfIgnoringCV|SameAs|Signed
|SpecializationOf|Unsigned|VoidFunction)\\b
(Without the linebreaks.)
This regex is pessimistic, so there might be more files that don't
actually use any concepts.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
These instances were detected by searching for files that include
AK/StdLibExtras.h, but don't match the regex:
\\b(abs|AK_REPLACED_STD_NAMESPACE|array_size|ceil_div|clamp|exchange|for
ward|is_constant_evaluated|is_power_of_two|max|min|mix|move|_RawPtr|RawP
tr|round_up_to_power_of_two|swap|to_underlying)\\b
(Without the linebreaks.)
This regex is pessimistic, so there might be more files that don't
actually use any "extra stdlib" functions.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
These instances were detected by searching for files that include
AK/Format.h, but don't match the regex:
\\b(CheckedFormatString|critical_dmesgln|dbgln|dbgln_if|dmesgln|FormatBu
ilder|__FormatIfSupported|FormatIfSupported|FormatParser|FormatString|Fo
rmattable|Formatter|__format_value|HasFormatter|max_format_arguments|out
|outln|set_debug_enabled|StandardFormatter|TypeErasedFormatParams|TypeEr
asedParameter|VariadicFormatParams|v_critical_dmesgln|vdbgln|vdmesgln|vf
ormat|vout|warn|warnln|warnln_if)\\b
(Without the linebreaks.)
This regex is pessimistic, so there might be more files that don't
actually use any formatting functions.
Observe that this revealed that Userland/Libraries/LibC/signal.cpp is
missing an include.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
These instances were detected by searching for files that include
Kernel/Debug.h, but don't match the regex:
\\bdbgln_if\(|_DEBUG\\b
This regex is pessimistic, so there might be more files that don't check
for any real *_DEBUG macro. There seem to be no corner cases anyway.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
This step would ideally not have been necessary (increases amount of
refactoring and templates necessary, which in turn increases build
times), but it gives us a couple of nice properties:
- SpinlockProtected inside Singleton (a very common combination) can now
obtain any lock rank just via the template parameter. It was not
previously possible to do this with SingletonInstanceCreator magic.
- SpinlockProtected's lock rank is now mandatory; this is the majority
of cases and allows us to see where we're still missing proper ranks.
- The type already informs us what lock rank a lock has, which aids code
readability and (possibly, if gdb cooperates) lock mismatch debugging.
- The rank of a lock can no longer be dynamic, which is not something we
wanted in the first place (or made use of). Locks randomly changing
their rank sounds like a disaster waiting to happen.
- In some places, we might be able to statically check that locks are
taken in the right order (with the right lock rank checking
implementation) as rank information is fully statically known.
This refactoring even more exposes the fact that Mutex has no lock rank
capabilites, which is not fixed here.
Using policy based design `SinglyLinkedList` and
`SinglyLinkedListWithCount` can be combined into one class which takes
a policy to determine how to keep track of the size of the list. The
default policy is to use list iteration to count the items in the list
each time. The `WithCount` form is a different policy which tracks the
size, but comes with the overhead of storing the count and
incrementing/decrementing on each modification.
This model is extensible to have other forms of counting by
implementing only a new policy instead of implementing a totally new
type.
These instances were detected by searching for files that include
Array.h, but don't match the regex:
\\b(Array(?!\.h>)|iota_array|integer_sequence_generate_array)\\b
These are the three symbols defined by Array.h.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
If a page fault occurs while interrupts are disabled, we were wrongly
enabling interrupts right away in the page fault handler.
Instead, we should only do this if interrupts were enabled when the
page fault occurred.
We were already handling the rmdir("..") case by refusing to remove
directories that were not empty.
This patch removes a FIXME from January 2019 and adds a test. :^)
Dr. POSIX says that we should reject attempts to rmdir() the file named
"." so this patch does exactly that. We also add a test.
This solves a FIXME from January 2019. :^)
This has been done in multiple ways:
- Each time we modeset the resolution via the VirtIOGPU DisplayConnector
we ensure that the framebuffer is updated with the new resolution.
- Each time the cursor is updated we ensure that the framebuffer console
is marked dirty so the IO Work Queue task which is scheduled to check
if it is dirty, will flush the surface.
- We only initialize a framebuffer console after we ensure that at the
very least a DisplayConnector has being set with a known resolution.
- We only call GenericFramebufferConsole::enable() when enabling the
console after the important variables of the console (m_width, m_pitch
and m_height) have been set.
Check if the process we are currently running is in a jail, and if that
is the case, fail early with the EPERM error code.
Also, as Brian noted, we should also disallow attaching to a jail in
case of already running within a setid executable, as this leaves the
user with false thinking of being secure (because you can't exec new
setid binaries), but the current program is still marked setid, which
means that at the very least we gained permissions while we didn't
expect it, so let's block it.
The hand-written assembly does not compile under Clang due to register
size mismatches. Using a loop is slower (~6 instructions on O2 as
opposed to 2 with hand-written assembly), but using the pause
instruction makes this more efficient even under TCG.
For pause we use isb sy which will put the processor to sleep while the
pipeline is being flushed. This instruction is also used by Rust in spin
loops and found to be more efficient, as well as being a rough
equivalent to the x86 pause instruction which we also use here.
For wait_check we use yield, which is a hinted nop that is faster to
execute, and I leave a FIXME for processing SMP messages once we support
SMP.
These two changes probably make spin loops work on aarch64 :^)
This commit changes the init.cpp file to start and initialize the
Scheduler, and actually runs init_stage2. To show that it actually
works, another thread is spawned and executed simultaneously, by context
switching between the two!
This requires two new functions, context_first_init and
restore_context_and_eret. With this code in place, we are now running
the first idle thread! :^)
This changes the stack pointer to the initial_thread stack pointer, and
pushes two pointers onto the stack that point to the initial_thread. The
function then jumps to the ip of the initial_thread, which will be
thread_context_first_enter, and hangs there because that function is not
yet implemented.
This does not handle everything correctly yet, such as setting the
correct state for running userspace applications, however this should be
enough to get kernel scheduling to work.
These are architecture-specific anyway, so they belong in the Arch
directory. This commit also adds ThreadRegisters::set_initial_state to
factor out the logic in Thread.cpp.
I can't think of a reason why copying the Processor class makes sense,
so lets make sure it's not possible to do it by accident by declaring
the copy constructor as deleted.
This removes the x86 specific hlt instruction from the scheduler, and
allows us to run the scheduler code for aarch64 by implementing
Processor::wait_for_interrupt for aarch64.
This file does not contain any architecture specific implementations,
so we can move it to the Kernel base directory. Also update the relevant
include paths.
We are actually storing tpidr_el0, as can be seen in vector_table.S, but
the RegisterState.h incorrectly had tpidr_el1. This will probably save
some annoying debugging later on.
This is necessary to allow transferring frame buffers larger than
~500x500 pixels back to user space. Until the buffer management is
improved this allows us to at least test the existing game ports.
The ARM CPU is set up to trap on unaligned accesses, however the
compiler will still generate them if this flag is not set. We also need
the -Wno-cast-align as there are some files in AK that don't build
without the flag.
This happens to be a sad truth for the VirtIOGPU driver - it lacked any
error propagation measures and generally relied on clunky assumptions
that most operations with the GPU device are infallible, although in
reality much of them could fail, so we do need to handle errors.
To fix this, synchronous GPU commands no longer rely on the wait queue
mechanism anymore, so instead we introduce a timeout-based mechanism,
similar to how other Kernel drivers use a polling based mechanism with
the assumption that hardware could get stuck in an error state and we
could abort gracefully.
Then, we change most of the VirtIOGraphicsAdapter methods to propagate
errors properly to the original callers, to ensure that if a synchronous
GPU command failed, either the Kernel or userspace could do something
meaningful about this situation.
The performance that we achieve from this technique is visually worse
compared to turning off this feature, so let's not use this until we
figure out why it happens.
`OwnPtrWithCustomDeleter` was a decorator which provided the ability
to add a custom deleter to `OwnPtr` by wrapping and taking the deleter
as a run-time argument to the constructor. This solution means that no
additional space is needed for the `OwnPtr` because it doesn't need to
store a pointer to the deleter, but comes at the cost of having an
extra type that stores a pointer for every instance.
This logic is moved directly into `OwnPtr` by adding a template
argument that is defaulted to the default deleter for the type. This
means that the type itself stores the pointer to the deleter instead
of every instance and adds some type safety by encoding the deleter in
the type itself instead of taking a run-time argument.
We now move the ErrorOr returning functions in the constructor to the
try_to_initialize() factory, which allows us to handle the errors and
removes two FIXME's :))
We add this basic functionality to the Kernel so Userspace can request a
particular virtual memory mapping to be immutable. This will be useful
later on in the DynamicLoader code.
The annotation of a particular Kernel Region as immutable implies that
the following restrictions apply, so these features are prohibited:
- Changing the region's protection bits
- Unmapping the region
- Annotating the region with other virtual memory flags
- Applying further memory advises on the region
- Changing the region name
- Re-mapping the region
This syscall will be used later on to ensure we can declare virtual
memory mappings as immutable (which means that the underlying Region is
basically immutable for both future annotations or changing the
protection bits of it).
As is, we never *deallocate* them, so we will run out eventually.
Creating a context, or allocating a context ID, now returns ErrorOr if
there are no available free context IDs.
`number_of_fixmes--;` :^)
This now only requires `size + alignment` bytes while searching for a
free memory location. For the actual allocation, the memory area is
properly trimmed to the required alignment.
This keeps us from tripping strict aliasing, which previously made TCP
connections inoperable when building without `-fsanitize=undefined` or
`-fno-strict-aliasing`.
This patch validates that the size of the auxiliary vector does not
exceed `Process::max_auxiliary_size`. The auxiliary vector is a range
of memory in userspace stack where the kernel can pass information to
the process that will be created via `Process:do_exec`.
The reason the kernel needs to validate its size is that the about to
be created process needs to have remaining space on the stack.
Previously only `argv` and `envp` were taken into account for the
size validation, with this patch, the size of `auxv` is also
checked. All three elements contain values that a user (or an
attacker) can specify.
This patch adds the constant `Process::max_auxiliary_size` which is
defined to be one eight of the user-space stack size. This is the
approach taken by `Process:max_arguments_size` and
`Process::max_environment_size` which are used to check the sizes
of `argv` and `envp`.
Note that this still keeps the old behaviour of putting things in std by
default on serenity so the tools can be happy, but if USING_AK_GLOBALLY
is unset, AK behaves like a good citizen and doesn't try to put things
in the ::std namespace.
std::nothrow_t and its friends get to stay because I'm being told that
compilers assume things about them and I can't yeet them into a
different namespace...for now.
When scanning for network adapters, we give each driver a chance to
claim the PCI device and whoever claims it first gets to keep it.
Before this patch, the driver API returned a LockRefPtr<AdapterType>,
which made it impossible to propagate errors that occurred during
detection and/or initialization.
This patch changes the API so that errors can bubble all the way out
the PCI enumeration in NetworkingManagement::initialize() where we
perform all the network adapter auto-detection on boot.
When we eventually start to support hot-plugging network adapter in the
future, it will be even more important to propagate errors instead of
swallowing them.
Importantly, before this patch, some errors were "handled" by panicking
the kernel. This is no longer the case.
7 FIXMEs were killed in the making of this commit. :^)
We were previously using a 32-bit unsigned integer for this, which
caused us to start truncating region sizes when multiplied with
`PAGE_SIZE` on hardware with a lot of memory.
Some programs explicitly ask for a different initial stack size than
what the OS provides. This is implemented in ELF by having a
PT_GNU_STACK header which has its p_memsz set to the amount that the
program requires. This commit implements this policy by reading the
p_memsz of the header and setting the main thread stack size to that.
ELF::Image::validate_program_headers ensures that the size attribute is
a reasonable value.
This commit makes it possible for a process to downgrade a file lock it
holds from a write (exclusive) lock to a read (shared) lock. For this,
the process must point to the exact range of the flock, and must be the
owner of the lock.
Joining dead threads is allowed for two main reasons:
- Thread join behavior should not be racy when a thread is joined and
exiting at roughly the same time. This is common behavior when threads
are given a signal to end (meaning they are going to exit ASAP) and
then joined.
- POSIX requires that exited threads are joinable (at least, there is no
language in the specification forbidding it).
The behavior is still well-defined; e.g. it doesn't allow a dead
detached thread to be joined or a thread to be joined more than once.
The fact that we used a Vector meant that even if creating a Mount
object succeeded, we were still at a risk that appending to the actual
mounts Vector could fail due to OOM condition. To guard against this,
the mount table is now an IntrusiveList, which always means that when
allocation of a Mount object succeeded, then inserting that object to
the list will succeed, which allows us to fail early in case of OOM
condition.
From now on, we don't allow jailed processes to open all device nodes in
/dev, but only allow jailed processes to open /dev/full, /dev/zero,
/dev/null, and various TTY and PTY devices (and not including virtual
consoles) so we basically restrict applications to what they can do when
they are in jail.
The motivation for this type of restriction is to ensure that even if a
remote code execution occurred, the damage that can be done is very
small.
We also don't restrict reading and writing on device nodes that were
already opened, because that limit seems not useful, especially in the
case where we do want to provide an OpenFileDescription to such device
but nothing further than that.
`SysFSComponentRegistry`, `ProcFSComponentRegistry` and
`attach_null_device` "just work" already; let's include them to match
x86_64 as closely as possible.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
This patch adds a way to ask the allocator to skip its internal
scrubbing memset operation. Before this change, kcalloc() would scrub
twice: once internally in kmalloc() and then again in kcalloc().
The same mechanism already existed in LibC malloc, and this patch
brings it over to the kernel heap allocator as well.
This solves one FIXME in kcalloc(). :^)
This solves one of the security issues being mentioned in issue #15996.
We simply don't allow creating hardlinks on paths that were not unveiled
as writable to prevent possible bypass on a certain path that was
unveiled as non-writable.
Instead, allow userspace to decide on the coredump directory path. By
default, SystemServer sets it to the /tmp/coredump directory, but users
can now change this by writing a new path to the sysfs node at
/sys/kernel/variables/coredump_directory, and also to read this node to
check where coredumps are currently generated at.
By default, disallow reading of values in that directory. Later on, we
will enable sparingly read access to specific files.
The idea that led to this mechanism was suggested by Jean-Baptiste
Boric (also known as boricj in GitHub), to prevent access to sensitive
information in the SysFS if someone adds a new file in the /sys/kernel
directory.
There's simply no benefit in allowing sandboxed programs to change the
power state of the machine, so disallow writes to the mentioned node to
prevent malicious programs to request that.
Previously we tried to determine if `fd` refers to a non-regular file by
doing a stat() operation on the file.
This didn't work out very well since many File subclasses don't
actually implement stat() but instead fall back to failing with EBADF.
This patch fixes the issue by checking for regular files with
File::is_regular_file() instead.
We used size_t, which is a type that is guarenteed to be large
enough to hold an array index, but uintptr_t is designed to be used
to hold pointer values, which is the case of stack guards.
To accomplish this, we add another VeilState which is called
LockedInherited. The idea is to apply exec unveil data, similar to
execpromises of the pledge syscall, on the current exec'ed program
during the execve sequence. When applying the forced unveil data, the
veil state is set to be locked but the special state of LockedInherited
ensures that if the new program tries to unveil paths, the request will
silently be ignored, so the program will continue running without
receiving an error, but is still can only use the paths that were
unveiled before the exec syscall. This in turn, allows us to use the
unveil syscall with a special utility to sandbox other userland programs
in terms of what is visible to them on the filesystem, and is usable on
both programs that use or don't use the unveil syscall in their code.
Because the ".." entry in a directory is a separate inode, if a
directory is renamed to a new location, then we should update this entry
the point to the new parent directory as well.
Co-authored-by: Liav A <liavalb@gmail.com>
Each GenericInterruptHandler now tracks the number of calls that each
CPU has serviced.
This takes care of a FIXME in the /sys/kernel/interrupts generator.
Also, the lsirq command line tool now displays per-CPU call counts.
This patch fixes some include problems on aarch64. aarch64 is still
currently broken but this will get us back to the underlying problem
of FloatExtractor.
We now disallow jail creation from a process within a jail because there
is simply no valid use case to allow it, and we will probably not enable
this behavior (which is considered a bug) again.
Although there was no "real" security issue with this bug, as a process
would still be denied to join that jail, there's an information reveal
about the amount of jails that are or were present in the system.
Add support for async transfers by using a separate kernel task to poll
a list of active async transfers on a set time interval, and invoke
their user-provided callback function when they are complete. Also add
support for the interrupt class of transfers, building off of this async
functionality.
Our implementation for Jails resembles much of how FreeBSD jails are
working - it's essentially only a matter of using a RefPtr in the
Process class to a Jail object. Then, when we iterate over all processes
in various cases, we could ensure if either the current process is in
jail and therefore should be restricted what is visible in terms of
PID isolation, and also to be able to expose metadata about Jails in
/sys/kernel/jails node (which does not reveal anything to a process
which is in jail).
A lifetime model for the Jail object is currently plain simple - there's
simpy no way to manually delete a Jail object once it was created. Such
feature should be carefully designed to allow safe destruction of a Jail
without the possibility of releasing a process which is in Jail from the
actual jail. Each process which is attached into a Jail cannot leave it
until the end of a Process (i.e. when finalizing a Process). All jails
are kept being referenced in the JailManagement. When a last attached
process is finalized, the Jail is automatically destroyed.
This adds try_* methods to AK::SinglyLinkedList and
AK::SinglyLinkedListWithCount and updates the network stack to use
those to gracefully handle allocation failures.
Refs #6369.
This decreases the number of bytes necessary to capture the variables
for this lambda. The next step will be to remove dynamic allocations
from AK::Function which depends on this change to keep the size of
AK::Function objects reasonable.
This is intended to reflect the POSIX sched_setparam API, which has some
cryptic language
(https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_08_04_01
) that as far as I can tell implies we should prioritize process
scheduling policies over thread scheduling policies. Technically this
means that a process must have its own sets of policies that are
considered first by the scheduler, but it seems unlikely anyone relies
on this behavior in practice. So we just override all thread's policies,
making them (at least before calls to pthread_setschedparam) behave
exactly like specified on the surface.
The priority range was changed several years ago, but the
userland-reported limits were just forgotten :skeleyak:. Move the thread
priority constants into an API header so that userland can use it
properly.
The syscalls are renamed as they no longer reflect the exact POSIX
functionality. They can now handle setting/getting scheduler parameters
for both threads and processes.
The kernel image grew so much that it wasn't possible to jump to the C++
symbol anymore, since this generated a 'relocation truncated' error when
linking.
Let's put the power_state global node into the /sys/kernel directory,
because that directory represents all global nodes and variables being
related to the Kernel. It's also a mutable node, that is more acceptable
being in the mentioned directory due to the fact that all other files in
the /sys/firmware directory are just firmware blobs and are not mutable
at all.
Now that all global nodes are located in the /sys/kernel directory, we
can safely drop the global nodes in /proc, which includes both /proc/net
and /proc/sys directories as well.
This in fact leaves the ProcFS to only have subdirectories for processes
and the "self" symbolic link to reflect the current process being run.
The ProcFS is an utter mess currently, so let's start move things that
are not related to processes-info. To ensure it's done in a sane manner,
we start by duplicating all /proc/ global nodes to the /sys/kernel/
directory, then we will move Userland to use the new directory so the
old directory nodes can be removed from the /proc directory.
If a program needs to execute a dynamic executable program, then it
should unveil /usr/lib/Loader.so by itself and not rely on the Kernel to
allow using this binary without any sense of respect to unveil promises
being made by the running parent program.
Previously we didn't send the SIGPIPE signal to processes when
sendto()/sendmsg()/etc. returned EPIPE. And now we do.
This also adds support for MSG_NOSIGNAL to suppress the signal.
This commit reached that goal of "safely discarding" a filesystem by
doing the following:
1. Stop using the s_file_system_map HashMap as it was an unsafe measure
to access pointers of FileSystems. Instead, make sure to register all
FileSystems at the VFS layer, with an IntrusiveList, to avoid problems
related to OOM conditions.
2. Make sure to cleanly remove the DiskCache object from a BlockBased
filesystem, so the destructor of such object will not need to do that in
the destruction point.
3. For ext2 filesystems, don't cache the root inode at m_inode_cache
HashMap. The reason for this is that when unmounting an ext2 filesystem,
we lookup at the cache to see if there's a reference to a cached inode
and if that's the case, we fail with EBUSY. If we keep the m_root_inode
also being referenced at the m_inode_cache map, we have 2 references to
that object, which will lead to fail with EBUSY. Also, it's much simpler
to always ask for a root inode and get it immediately from m_root_inode,
instead of looking up the cache for that inode.
The idea is to enable mounting FileSystem objects across multiple mounts
in contrast to what happened until now - each mount has its own unique
FileSystem object being attached to it.
Considering a situation of mounting a block device at 2 different mount
points at in system, there were a couple of critical flaws due to how
the previous "design" worked:
1. BlockBasedFileSystem(s) that pointed to the same actual device had a
separate DiskCache object being attached to them. Because both instances
were not synchronized by any means, corruption of the filesystem is most
likely achieveable by a simple cache flush of either of the instances.
2. For superblock-oriented filesystems (such as the ext2 filesystem),
lack of synchronization between both instances can lead to severe
corruption in the superblock, which could render the entire filesystem
unusable.
3. Flags of a specific filesystem implementation (for example, with xfs
on Linux, one can instruct to mount it with the discard option) must be
honored across multiple mounts, to ensure expected behavior against a
particular filesystem.
This patch put the foundations to start fix the issues mentioned above.
However, there are still major issues to solve, so this is only a start.
We now have a seperately allocated structure for the bookkeeping
information in the QueueHead and TransferDescriptor UHCI strucutres.
This way, we can support 64-bit pointers in UHCI, fixing a problem where
32-bit pointers would truncate the upper 32-bits of the (virtual)
address of the descriptor, causing a crash.
Co-authored-by: b14ckcat <b14ckcat@protonmail.com>
This flag doesn't conform to any POSIX standard nor is found in any OS
out there. The idea behind this mount flag is to ensure that only
non-regular files will be placed in a filesystem, which includes device
nodes, symbolic links, directories, FIFOs and sockets. Currently, the
only valid case for using this mount flag is for TmpFS instances, where
we want to mount a TmpFS but disallow any kind of regular file and only
allow other types of files on the filesystem.
Although this code worked quite well, it is considered to be a code
duplication with the TmpFS code which is more tested and works quite
well for a variety of cases. The only valid reason to keep this
filesystem was that it enforces that no regular files will be created at
all in the filesystem. Later on, we will re-introduce this feature in a
sane manner. Therefore, this can be safely removed after SystemServer no
longer uses this filesystem type anymore.
Instead of using absolute paths which is considered an abstraction layer
violation between the kernel and userspace, let's not hardcode the path
to children PID directories but instead we can use relative path links
to them.
Having this function return `nullptr` explicitly triggers the compiler's
inbuilt checker, as it knows the destination is null. Having this as a
static (scoped) variable for now circumvents this problem.
Decompose the current monolithic USBD Pipe interface into several
subclasses, one for each pair of endpoint type & direction. This is to
make it more clear what data and functionality belongs to which Pipe
type, and prevent nonsensical things like trying to execute a control
transfer on a non-control pipe. This is important, because the Pipe
class is the interface by which USB device drivers will interact with
the HCD, so the clearer and more explicit this interface is the better.
Allocate DMA buffer pages for use within the USBD Pipe class, and allow
for the user to specify the size of this buffer, rounding up to the
next page boundary.
This sets up the RPi::Timer to trigger an interurpt every 4ms using one
of the comparators. The actual time is calculated by looking at the main
counter of the RPi::Timer using the Timer::update_time function.
A stub for Scheduler::timer_tick is also added, since the TimeManagement
code now calls the function.
Instead of just having a giant KBuffer that is not resizeable easily, we
use multiple AnonymousVMObjects in one Vector to store them.
The idea is to not have to do giant memcpy or memset each time we need
to allocate or de-allocate memory for TmpFS inodes, but instead, we can
allocate only the desired block range when trying to write to it.
Therefore, it is also possible to have data holes in the inode content
in case of skipping an entire set of one data block or more when writing
to the inode content, thus, making memory usage much more efficient.
To ensure we don't run out of virtual memory range, don't allocate a
Region in advance to each TmpFSInode, but instead try to allocate a
Region on IO operation, and then use that Region to map the VMObjects
in IO loop.
This makes it easier to differentiate between cases where certain
functionality is not implemented vs. cases where a code location
should really be unreachable.
It costs us nothing, and some utilities (such as the known file utility)
rely on the exposed file size (after doing lstat on it), to show
anything useful besides saying the file is "empty".
In a few places we check `!Processor::in_critical()` to validate
that the current processor doesn't hold any kernel spinlocks.
Instead lets provide it a first class name for readability.
I'll also be adding more of these, so I would rather add more
usages of a nice API instead of this implicit/assumed logic.
This change ensures that the scheduler doesn't depend on a platform
specific or arch-specific code when it initializes itself, but rather we
ensure that in compile-time we will generate the appropriate code to
find the correct arch-specific current time methods.
For some odd reason we used to return PhysicalPtr for a page_table_base
result, but when setting it we accepted only a 32 bit value, so we
truncated valid 64 bit addresses into 32 bit addresses by doing that.
With this commit being applied, now PageDirectories can be located
beyond the 4 GiB barrier.
This was found by sin-ack, therefore he should be credited with this fix
appropriately with Co-authored-by sign.
Co-authored-by: sin-ack <sin-ack@users.noreply.github.com>
There is no particular reason why this section should be marked as
`NOBITS` (as it might very well include initialized values), and it
resolves 90% of the mismatches between the input and output sections,
which LLD now warns about when linking.
Doesn't use them in libc headers so that those don't have to pull in
AK/Platform.h.
AK_COMPILER_GCC is set _only_ for gcc, not for clang too. (__GNUC__ is
defined in clang builds as well.) Using AK_COMPILER_GCC simplifies
things some.
AK_COMPILER_CLANG isn't as much of a win, other than that it's
consistent with AK_COMPILER_GCC.
Nobody uses this because the x86 prekernel environment is corrupting the
ramdisk image prior to running the actual kernel. In the future we can
ensure that the prekernel doesn't corrupt the ramdisk if we want to
bring support back. In addition to that, we could just use a RAM based
filesystem to load whatever is needed like in Linux, without the need of
additional filesystem driver.
For the mentioned corruption problem, look at issue #9893.
The BootFramebufferConsole class maps the framebuffer using the
MemoryManager, so to be able to draw the logo, we need to get this
mapped framebuffer. This commit adds a unsafe API for that.
The MemoryManager now works, so we can use the same code as on x86 to
map the framebuffer. Since it uses the MemoryManager, the initialization
of the BootFramebufferConsole has to happen after the MemoryManager is
working.
For the initial page tables we only need to identity map the kernel
image, the rest of the memory will be managed by the MemoryManager. The
linker script is updated to get the kernel image start and end
addresses.
The page table and page directory formats are architecture specific, so
move the headers into the Arch directory. Also move the aarch64 page
table constants from aarch64/MMU.cpp to aarch64/PageDirectory.h.
When an exception happens it is sometimes hard to figure out where
exactly the exception happened, so use the frame pointer of the trap
frame to print a backtrace.
We no longer require to lock the m_inode_lock in the SharedInodeVMObject
code as the methods write_bytes and read_bytes of the Inode class do
this for us now.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
To implement this restriction, we simply check if the result of
read_bytes on an Inode returns 0, which means we have nothing valid to
map to the program, hence it should receive a SIGBUS in that case.
This value will be used later on by WindowServer to reject resolutions
that will request a mapping that will overflow the hardware framebuffer
max length.
This class is intended to replace all IOAddress usages in the Kernel
codebase altogether. The idea is to ensure IO can be done in
arch-specific manner that is determined mostly in compile-time, but to
still be able to use most of the Kernel code in non-x86 builds. Specific
devices that rely on x86-specific IO instructions are already placed in
the Arch/x86 directory and are omitted for non-x86 builds.
The reason this works so well is the fact that x86 IO space acts in a
similar fashion to the traditional memory space being available in most
CPU architectures - the x86 IO space is essentially just an array of
bytes like the physical memory address space, but requires x86 IO
instructions to load and store data. Therefore, many devices allow host
software to interact with the hardware registers in both ways, with a
noticeable trend even in the modern x86 hardware to move away from the
old x86 IO space to exclusively using memory-mapped IO.
Therefore, the IOWindow class encapsulates both methods for x86 builds.
The idea is to allow PCI devices to be used in either way in x86 builds,
so when trying to map an IOWindow on a PCI BAR, the Kernel will try to
find the proper method being declared with the PCI BAR flags.
For old PCI hardware on non-x86 builds this might turn into a problem as
we can't use port mapped IO, so the Kernel will gracefully fail with
ENOTSUP error code if that's the case, as there's really nothing we can
do within such case.
For general IO, the read{8,16,32} and write{8,16,32} methods are
available as a convenient API for other places in the Kernel. There are
simply no direct 64-bit IO API methods yet, as it's not needed right now
and is not considered to be Arch-agnostic too - the x86 IO space doesn't
support generating 64 bit cycle on IO bus and instead requires two 2
32-bit accesses. If for whatever reason it appears to be necessary to do
IO in such manner, it could probably be added with some neat tricks to
do so. It is recommended to use Memory::TypedMapping struct if direct 64
bit IO is actually needed.
The APICTimer, HPET and RTC (the RTC timer is in the context of the PC
RTC here) are timers that exist only in x86 platforms, therefore, we
move the handling code and the initialization code to the Arch/x86/Time
directory. Other related code patterns in the TimeManagement singleton
and in the Random.cpp file are guarded with #ifdef to ensure they are
only compiled for x86 builds.
The new VGAIOArbiter class is now responsible to conduct x86-specific
instructions to control VGA hardware from the old ISA ports. This allows
us to ensure the GraphicsManagement code doesn't use x86-specific code,
thus allowing it to be compiled within non-x86 kernel builds.
The BootFramebufferConsole highly depends on using the m_lock spinlock,
therefore setting and changing the cursor state should be done under
that spinlock too to avoid crashing.
Only the Console code in the Graphics directory should be able to call
on these methods. The set_cursor method stays public as VirtualConsole
uses that method to change the cursor position.
This device is supposed to be used in microvm and ISA-PC machine types,
and we assume that if we are able to probe for the QEMU BGA version of
0xB0C5, then we have an existing ISA Bochs VGA adapter to utilize.
To ensure we don't instantiate the driver for non isa-vga devices, we
try to ensure that PCI is disabled because hardware IO test probe failed
so we can be sure that we use this special handling code only in the
QEMU microvm and ISA-PC machine types. Unfortunately, this means that if
for some reason the isa-vga device is attached for the i440FX or Q35
machine types, we simply are not able to drive the device in such setups
at all.
To determine the amount of VRAM being available, we read VBE register at
offset 0xA. That register holds the amount of VRAM divided by 64K, so we
need to multiply the value in our code to use the actual VRAM size value
again.
The isa-vga device requires us to hardcode the framebuffer physical
address to 0xE0000000, and that address is not expected to change in the
future as many other projects rely on the isa-vga framebuffer to be
present at that physical memory address.
We should aim to reliably determine if PCI hardware exists or not, and
we should consider the ACPI MCFG table in that test. Although it is
unusual to see an hardware setup where the PCI host bridge does not
respond to x86 IO instructions, it is expected to happen at least on the
QEMU microvm machine type as the host bridge only responds to memory
mapped IO requests. Therefore, we first test if ACPI is enabled, and we
try to use it to fetch the MCFG table. Later on we could also add FDT
parsing as part of the PCI IO test which would be useful for the QEMU
microvm machine type.
We use a ScopeGuard to ensure we always set a console of some sort if we
exit early from the initialization sequence in the GraphicsManagement
code. We do so to ensure we can boot into text mode console in an ISA-PC
machine type, because earlier we failed with an assertion due to not
setting any console for VirtualConsole to use.
ISA IDE controllers don't support Bus-master DMA as this feature is only
available for PCI IDE controllers. Therefore, don't try to use DMA mode
for such hardware.
The code in init.cpp is specific to the x86 initialization sequence, so
move it to the Arch/x86 directory in the same fashion like the aarch64
pattern.
The PIC and APIC code are specific to x86 platforms, so move them out of
the general Interrupts directory to Arch/x86/common/Interrupts directory
instead.
That code heavily relies on x86-specific instructions, and while other
CPU architectures and platforms can have PCI IDE controllers, currently
we don't support those, so this code is a special case which needs to be
in the Arch/x86 directory.
In the future it could be put back to the original place when we make it
more generic and suitable for other platforms.
The i8042 controller with its attached devices, the PS2 keyboard and
mouse, rely on x86-specific IO instructions to work. Therefore, move
them to the Arch/x86 directory to make it easier to omit the handling
code of these devices.
The ISA IDE controller code makes sense to be compiled in a x86 build as
it relies on access to the x86 IO space. For other architectures, we can
just omit the code as there's no way we can use that code again.
To ensure we can omit the code easily, we move it to the Arch/x86
directory.
The AHCI code doesn't rely on x86 IO at all as it only uses memory
mapped IO so we can simply remove the header.
We also simply don't use x86 IO in the Intel graphics driver, so we can
simply remove the include of the x86 IO header there too.
Everything else was a bunch of stale includes to the x86 IO header and
are actually not necessary, so let's remove them to make it easier to
compile non-x86 Kernel builds.
The VMWare backdoor handling code involves many x86-specific
instructions and therefore should be in the Arch/x86 directory. This
ensures we can easily omit the code in compile-time for non-x86 builds.
It seems more correct to let each platform to define its own sequence of
initialization of the PCI bus, so let's remove the #if flags and just
put the entire Initializer.cpp file in the appropriate code directory.
Only use the Bochs debug output if we compile a x86 build since bochs
debug output relies on x86 specific instructions.
We also remove the CONSOLE_OUT_TO_BOCHS_DEBUG_PORT flag as we always
compile bochs debug output for x86 builds and we always want to include
the bochs debug output capability as it is very handy and doesn't hurt
bare metal hardware or do any other problem besides taking a small
amount of CPU cycles.
The simple PCI::HostBridge class implements access to the PCI
configuration space by using x86 IO instructions. Therefore, it should
be put in the Arch/x86/PCI directory so it can be easily omitted for
non-x86 builds.
kprintf should not really care about the hardware-specific details of
each UART or serial port out there, so instead of using x86 specific
instructions, let's ensure that we will compile only the relevant code
for debug output for a targeted-specific platform.
The RTC and CMOS are currently only supported for x86 platforms and use
specific x86 instructions to produce only certain x86 plaform operations
and results, therefore, we move them to the Arch/x86 specific directory.
Many code patterns and hardware procedures rely on reliable delay in the
microseconds granularity, and since they are using such delays which are
valid cases, but should not rely on x86 specific code, we allow to
determine in compile time the proper platform-specific code to use to
invoke such delays.
We only use the RTC code in the Kernel, so it doesn't make sense to make
the RTC namespace outside of it. In addition to that, we will need later
on to use the RTC in an x86 specific manner and this will help us to use
this code in such fashion.
We move QEMU and VirtualBox shutdown sequences to a separate file, as
well as moving the i8042 reboot code sequence too to another file.
This allows us to abstract specific methods from the power state node
code of the SysFS filesystem, to allow other architectures to put their
methods there too in the future.
Using the IO address space is only relevant for x86 machines, so let's
not compile instructions to access the PCI configuration space when we
don't target x86 platforms.
This remained undetected for a long time as HeaderCheck is disabled by
default. This commit makes the following file compile again:
// file: compile_me.cpp
#include <Kernel/API/POSIX/ucontext.h>
// That's it, this was enough to cause a compilation error.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
We make these methods non-virtual because we want to ensure we properly
enforce locking of the m_inode_lock mutex. Also, for write operations,
we want to call prepare_to_write_data before the actual write. The
previous design required us to ensure the callers do that at various
places which lead to hard-to-find bugs. By moving everything to a place
where we call prepare_to_write_data only once, we eliminate a possibilty
of forgeting to call it on some code path in the kernel.
The block list required a bit of work, and now the only method being
declared const to bypass its const-iness is the read_bytes method that
calls a new method called compute_block_list_with_exclusive_locking that
takes care of proper locking before trying to update the block list data
of the ext2 inode.