Using the kernel stack is preferable, especially when the examined
strings should be limited to a reasonable length.
This is a small improvement, because if we don't actually move these
strings then we don't need to own heap allocations for them during the
syscall handler function scope.
In addition to that, some kernel strings are known to be limited, like
the hostname string, for these strings we also can use FixedStringBuffer
to store and copy to and from these buffers, without using any heap
allocations at all.
Instead, use the FixedCharBuffer class to ensure we always use a static
buffer storage for these names. This ensures that if a Process or a
Thread were created, there's a guarantee that setting a new name will
never fail, as only copying of strings should be done to that static
storage.
The limits which are set are 32 characters for processes' names and 64
characters for thread names - this is because threads' names could be
more verbose than processes' names.
This class encapsulates a fixed Array with compile-time size definition
for storing ASCII characters.
There are also new Kernel StdLib functions to copy user data into such
objects so this class will be useful later on.
Previously we could get a raw pointer to a Mount object which might be
invalid when actually dereferencing it.
To ensure this could not happen, we should just use a callback that will
be used immediately after finding the appropriate Mount entry, while
holding the mount table lock.
We don't really need this method anymore, because we could just try to
find the mount entry based on the given mount point host custody.
This also allows us to remove the is_vfs_root and root_inode_id methods
from the VirtualFileSystem class.
We could easily encounter a case where we do the following:
```
mkdir -p /tmp2
mount /dev/hda /tmp2
```
would produce a bug that doing `ls /tmp2/tmp2` will give the contents
on `/dev/hda` ext2 root directory and also on `/tmp2/tmp2/tmp2` and so
on.
To prevent this, we must compare the current custody against each mount
entry's custody to ensure their paths match.
This is not useful, as we have literally zero knowledge about where this
inode is actually located at with respect to the entire global path tree
so we could easily encounter a case where we do the following:
```
mkdir -p /tmp2
mount /dev/hda /tmp2
```
and when traversing the /tmp2 directory entries, we will see the root
inode of /dev/hda on "/tmp2/tmp2", even if it was not mounted.
Therefore, we should just plainly give the raw directory entries as they
are written "on the disk". Anything else that needs to exactly know if
there's an underlying mounted filesystem, can just use the stat syscall
instead.
This ensures that the host mount point custody path is not the same like
the new to-be-mounted custody.
A scenario that could happen before adding this check is:
```
mkdir -p /tmp2
mount /dev/hda /tmp2/
mount /dev/hda /tmp2/
mount /dev/hda /tmp2/ # this will fail here
```
and after adding this check, the following scenario is now this:
```
mkdir -p /tmp2
mount /dev/hda /tmp2/
mount /dev/hda /tmp2/ # this will fail here
mount /dev/hda /tmp2/ # this will fail here too
```
Currently, ephemeral port allocation is handled by the
allocate_local_port_if_needed() and protocol_allocate_local_port()
methods. Actually binding the socket to an address (which means
inserting the socket/address pair into a global map) is performed either
in protocol_allocate_local_port() (for ephemeral ports) or in
protocol_listen() (for non-ephemeral ports); the latter will fail with
EADDRINUSE if the address is already used by an existing pair present in
the map.
There used to be a bug where for listen() without an explicit bind(),
the port allocation would conflict with itself: first an ephemeral port
would get allocated and inserted into the map, and then
protocol_listen() would check again for the port being free, find the
just-created map entry, and error out. This was fixed in commit
01e5af487f by passing an additional flag
did_allocate_port into protocol_listen() which specifies whether the
port was just allocated, and skipping the check in protocol_listen() if
the flag is set.
However, this only helps if the socket is bound to an ephemeral port
inside of this very listen() call. But calling bind(sin_port = 0) from
userspace should succeed and bind to an allocated ephemeral port, in the
same was as using an unbound socket for connect() does. The port number
can then be retrieved from userspace by calling getsockname (), and it
should be possible to either connect() or listen() on this socket,
keeping the allocated port number. Also, calling bind() when already
bound (either explicitly or implicitly) should always result in EINVAL.
To untangle this, introduce an explicit m_bound state in IPv4Socket,
just like LocalSocket has already. Once a socket is bound, further
attempt to bind it fail. Some operations cause the socket to implicitly
get bound to an (ephemeral) address; this is implemented by the new
ensure_bound() method. The protocol_allocate_local_port() method is
gone; it is now up to a protocol to assign a port to the socket inside
protocol_bind() if it finds that the socket has local_port() == 0.
protocol_bind() is now called in more cases, such as inside listen() if
the socket wasn't bound before that.
Since this is the block size that file system drivers *should* set,
let's name it the logical block size, just like most file systems such
as ext2 already do anyways.
This never was a logical block size, it always was a device specific
block size. Ideally the block size would change in accordance to
whatever the driver wants to use, but that is a change for the future.
For now, let's get rid of this confusing naming.
This also makes it easier to understand and reference where these
(sometimes rather arbitrary) calculations come from.
This also fixes a bug where group_index_from_block_index assumed 1KiB
blocks.
For a long time, our shutdown procedure has basically been:
- Acquire big process lock.
- Switch framebuffer to Kernel debug console.
- Sync and lock all file systems so that disk caches are flushed and
files are in a good state.
- Use firmware and architecture-specific functionality to perform
hardware shutdown.
This naive and simple shutdown procedure has multiple issues:
- No processes are terminated properly, meaning they cannot perform more
complex cleanup work. If they were in the middle of I/O, for instance,
only the data that already reached the Kernel is written to disk, and
data corruption due to unfinished writes can therefore still occur.
- No file systems are unmounted, meaning that any important unmount work
will never happen. This is important for e.g. Ext2, which has
facilites for detecting improper unmounts (see superblock's s_state
variable) and therefore requires a proper unmount to be performed.
This was also the starting point for this PR, since I wanted to
introduce basic Ext2 file system checking and unmounting.
- No hardware is properly shut down beyond what the system firmware does
on its own.
- Shutdown is performed within the write() call that asked the Kernel to
change its power state. If the shutdown procedure takes longer (i.e.
when it's done properly), this blocks the process causing the shutdown
and prevents any potentially-useful interactions between Kernel and
userland during shutdown.
In essence, current shutdown is a glorified system crash with minimal
file system cleanliness guarantees.
Therefore, this commit is the first step in improving our shutdown
procedure. The new shutdown flow is now as follows:
- From the write() call to the power state SysFS node, a new task is
started, the Power State Switch Task. Its only purpose is to change
the operating system's power state. This task takes over shutdown and
reboot duties, although reboot is not modified in this commit.
- The Power State Switch Task assumes that userland has performed all
shutdown duties it can perform on its own. In particular, it assumes
that all kinds of clean process shutdown have been done, and remaining
processes can be hard-killed without consequence. This is an important
separation of concerns: While this commit does not modify userland, in
the future SystemServer will be responsible for performing proper
shutdown of user processes, including timeouts for stubborn processes
etc.
- As mentioned above, the task hard-kills remaining user processes.
- The task hard-kills all Kernel processes except itself and the
Finalizer Task. Since Kernel processes can delay their own shutdown
indefinitely if they want to, they have plenty opportunity to perform
proper shutdown if necessary. This may become a problem with
non-cooperative Kernel tasks, but as seen two commits earlier, for now
all tasks will cooperate within a few seconds.
- The task waits for the Finalizer Task to clean up all processes.
- The task hard-kills and finalizes the Finalizer Task itself, meaning
that it now is the only remaining process in the system.
- The task syncs and locks all file systems, and then unmounts them. Due
to an unknown refcount bug we currently cannot unmount the root file
system; therefore the task is able to abort the clean unmount if
necessary.
- The task performs platform-dependent hardware shutdown as before.
This commit has multiple remaining issues (or exposed existing ones)
which will need to be addressed in the future but are out of scope for
now:
- Unmounting the root filesystem is impossible due to remaining
references to the inodes /home and /home/anon. I investigated this
very heavily and could not find whoever is holding the last two
references.
- Userland cannot perform proper cleanup, since the Kernel's power state
variable is accessed directly by tools instead of a proper userland
shutdown procedure directed by SystemServer.
The recently introduced Firmware/PowerState procedures are removed
again, since all of the architecture-independent code can live in the
power state switch task. The architecture-specific code is kept,
however.
Once we move to a more proper shutdown procedure, processes other than
the finalizer task must be able to perform cleanup and finalization
duties, not only because the finalizer task itself needs to be cleaned
up by someone. This global variable, mirroring the early boot flags,
allows a future shutdown process to perform cleanup on its own.
Note that while this *could* be considered a weakening in security, the
attack surface is minimal and the results are not dramatic. To exploit
this, an attacker would have to gain a Kernel write primitive to this
global variable (bypassing KASLR among other things) and then gain some
way of calling the relevant functions, all of this only to destroy some
other running process. The same effect can be achieved with LPE which
can often be gained with significantly simpler userspace exploits (e.g.
of setuid binaries).
Since we never check a kernel process's state like a userland process,
it's possible for a kernel process to ignore the fact that someone is
trying to kill it, and continue running. This is not desireable if we
want to properly shutdown all processes, including Kernel ones.
This is correct since unmount doesn't treat bind mounts specially. If we
don't do this, unmounting bind mounts will call
prepare_for_last_unmount() on the guest FS much too early, which will
most likely fail due to a busy file system.
Previously, we started parsing the ELF file again in a completely
different place, and without the partial mapping that we do while
validating.
Instead of doing manual parsing in two places, just capture the
requested stack size right after we validated it.
This resolves the various "implicit truncation from int to a one-bit
wide bit-field changes value from 1 to -1" warnings produced by Clang
16+ when assigning to single-bit bitfields.
The driver would crash if it was unable to find an output route, and
subsequently the destruction of controller did not invoke
`GenericInterruptHandler::will_be_destroyed()` because on the level of
`AudioController`, that method is unavailable.
By decoupling the interrupt handling from the controller, we get a new
refcounted class that correctly cleans up after itself :^)
We used to not care about stopping an audio output stream for Intel HDA
since AudioServer would continuously send new buffers to play. Since
707f5ac150ef858760eb9faa52b9ba80c50c4262 however, that has changed.
Intel HDA now uses interrupts to detect when each buffer was completed
by the device, and uses a simple heuristic to detect whether a buffer
underrun has occurred so it can stop the output stream.
This was tested on Qemu's Intel HDA (Linux x86_64) and a bare metal MSI
Starship/Matisse HD Audio Controller.
This is a preparation before we can create a usable mechanism to use
filesystem-specific mount flags.
To keep some compatibility with userland code, LibC and LibCore mount
functions are kept being usable, but now instead of doing an "atomic"
syscall, they do multiple syscalls to perform the complete procedure of
mounting a filesystem.
The FileBackedFileSystem IntrusiveList in the VFS code is now changed to
be protected by a Mutex, because when we mount a new filesystem, we need
to check if a filesystem is already created for a given source_fd so we
do a scan for that OpenFileDescription in that list. If we fail to find
an already-created filesystem we create a new one and register it in the
list if we successfully mounted it. We use a Mutex because we might need
to initiate disk access during the filesystem creation, which will take
other mutexes in other parts of the kernel, therefore making it not
possible to take a spinlock while doing this.
Otherwise, reading will sometimes fail on the Raspberry Pi.
This is mostly a hack, the spec has some info about how the correct
divisor should be calculated and how we can recover from timeouts.
Namely, we previously forgot to configure the SD Host Controller for
4-bit mode after issuing ACMD6, which caused data transfers to fail on
bare metal.
Instead of using ifdefs to use the correct platform-specific methods, we
can just use the same pattern we use for the microseconds_delay function
which has specific implementations for each Arch CPU subdirectory.
When linking a kernel image, the actual correct and platform-specific
power-state changing methods will be called in Firmware/PowerState.cpp
file.
Since https://reviews.llvm.org/D131441, libc++ must be included before
LibC. As clang includes libc++ as one of the system includes, LibC
must be included after those, and the only correct way to do that is
to install LibC's headers into the sysroot.
Targets that don't link with LibC yet require its headers for one
reason or another must add install_libc_headers as a dependency to
ensure that the correct headers have been (re)installed into the
sysroot.
LibC/stddef.h has been dropped since the built-in stddef.h receives
a higher include priority.
In addition, string.h and wchar.h must
define __CORRECT_ISO_CPP_STRING_H_PROTO and
_LIBCPP_WCHAR_H_HAS_CONST_OVERLOADS respectively in order to tell
libc++ to not try to define methods implemented by LibC.
Once LibC is installed to the sysroot and its conflicts with libc++
are resolved, including LibC headers in such a way will cause errors
with a modern LLVM-based toolchain.
This is needed to avoid including LibC headers in Lagom builds.
Unfortunately, we cannot rely on the build machine to provide a
fully POSIX-compatible ELF header for Lagom builds, so we have to
use our own.