To count the remaining children, we simply need to traverse the
directory and increment a counter. No need for a custom virtual that
all file systems have to implement. :^)
Unless we're accessing mutex-guarded metadata, there's no need to
acquire the inode lock.
The file system ID or inode index of a constructed inode will never
change, for example.
Reimplement directory traversal in terms of read_bytes() instead of
doing direct block access. This lets us avoid taking the inode lock
while iterating over the directory contents.
Once we've finalized all the file system metadata in flush_writes(),
we no longer need to hold the file system lock during the call to
BlockBasedFileSystem::flush_writes().
Ext2FS::get_inode() will remember unknown inode indices that it has
been asked about and put them into the inode cache as null inodes.
flush_writes() was not null-checking these while iterating, which
was a bug I finally managed to hit.
Flushing also seemed like a good time to drop unknown inodes from
the cache, since there's no good reason to hold to them indefinitely.
The file system lock is meant to protect the file system metadata
(super blocks, bitmaps, etc.) Not protect processes from reading
independent parts of the disk at once.
This patch introduces a new lock to protect the *block cache* instead,
which is the real thing that needs synchronization.
Forcing users of a FileDescription to seek before they can read/write
makes it inherently racy. This patch adds variants of read/write that
simply ignore the "current offset" of the description in favor of a
caller-supplied offset.
In case we are about to delete the PID directory, we clear the Process
pointer. If someone still holds a reference to the PID directory (by
opening it), we still need to delete the process, but we can't delete
the directory, so we will keep it alive, but any operation on it will
fail by propogating the error to userspace about that the Process was
deleted and therefore there's no meaning to trying to do operations on
the directory.
Fixes#8576.
We need some overflow checks due to the implementation of TmpFS.
When size_t is 32 bits and off_t is 64 bits, we might overflow our
KBuffer max size and confuse the KBuffer set_size code, causing a VERIFY
failure. Make sure that resulting offset + size will fit in a size_t.
Another constraint, we make sure that the resulting offset + size will
be less than half of the maximum value of a size_t, because we double
the KBuffer size each time we resize it.
Previously, VirtualFileSystem::mkdir() would always return ENOENT if
no parent custody was returned by resolve_path(). This is incorrect when
e.g. the user has no search permission in a component of the path
prefix (=> EACCES), or if on component of the path prefix is a file (=>
ENOTDIR). This patch fixes that behavior.
This adds KLexicalPath::try_join(). As this cannot be done without
allocation, it uses KString and can fail. This patch also uses it at one
place. All the other cases of String::formatted("{}/{}", ...) currently
rely on the return value being a String, which means they cannot easily
be converted to use the new API.
This replaces all uses of LexicalPath in the Kernel with the functions
from KLexicalPath. This also allows the Kernel to stop including
AK::LexicalPath.
This change enforces that paths passed to
VFS::validate_path_against_process_veil are absolute and do not contain
any '..' or '.' parts. We should VERIFY here instead of returning EINVAL
since the code that calls this should resolve non-canonical paths before
calling this function.
Previously, Custody::absolute_path() was called for every call to
validate_path_against_process_veil(). For processes that don't have a
veil, the path is not used by the function. This means that it is
unnecessarily generated. This introduces an overload to
validate_path_against_process_veil(), which takes a Custody const& and
only generates the absolute path if it there is actually a veil and it
is thus needed.
This patch results in a speed up of Assistant's file system cache
building by around 16 percent.
These small changes fix the remaining warnings that come up during
kernel compilation with Clang. These specific fixes were for benign
things: unused lambda captures and braces around scalar initializers.
The `#pragma GCC diagnostic` part is needed because the class has
virtual methods with the same name but different arguments, and Clang
tries to warn us that we are not actually overriding anything with
these.
Weirdly enough, GCC does not seem to care.
Now we use WeakPtrs to break Ref-counting cycle. Also, we call the
prepare_for_deletion method to ensure deleted objects are ready for
deletion. This is necessary to ensure we don't keep dead processes,
which would become zombies.
In addition to that, add some debug prints to aid debug in the future.
This changes the m_parts, m_dirname, m_basename, m_title and m_extension
member variables to StringViews onto the m_string String. It also
removes the m_is_absolute member in favour of computing if a path is
absolute in the is_absolute() getter. Due to this, the canonicalize()
method has been completely rewritten.
The parts() getter still returns a Vector<String>, although it is no
longer a const reference as m_parts is no longer a Vector<String>.
Rather, it is constructed from the StringViews in m_parts upon request.
The parts_view() getter has been added, which returns Vector<StringView>
const&. Most previous users of parts() have been changed to use
parts_view(), except where Strings are required.
Due to this change, it's is now no longer allow to create temporary
LexicalPath objects to call the dirname, basename, title, or extension
getters on them because the returned StringViews will point to possible
freed memory.
The LexicalPath instance methods dirname(), basename(), title() and
extension() will be changed to return StringView const& in a further
commit. Due to this, users creating temporary LexicalPath objects just
to call one of those getters will recieve a StringView const& pointing
to a possible freed buffer.
To avoid this, static methods for those APIs have been added, which will
return a String by value to avoid those problems. All cases where
temporary LexicalPath objects have been used as described above haven
been changed to use the static APIs.
It didn't make any sense to hardcode the modified time of all created
inodes with "mepoch", so we should query the procfs "backend" to get
the modified time value.
Since ProcFS is dynamically changed all the time, the modified time
equals to the querying time.
This could be changed if desired, by making the modified_time()
method virtual and overriding it in different procfs-backed objects :)
The new ProcFS design consists of two main parts:
1. The representative ProcFS class, which is derived from the FS class.
The ProcFS and its inodes are much more lean - merely 3 classes to
represent the common type of inodes - regular files, symbolic links and
directories. They're backed by a ProcFSExposedComponent object, which
is responsible for the functional operation behind the scenes.
2. The backend of the ProcFS - the ProcFSComponentsRegistrar class
and all derived classes from the ProcFSExposedComponent class. These
together form the entire backend and handle all the functions you can
expect from the ProcFS.
The ProcFSExposedComponent derived classes split to 3 types in the
manner of lifetime in the kernel:
1. Persistent objects - this category includes all basic objects, like
the root folder, /proc/bus folder, main blob files in the root folders,
etc. These objects are persistent and cannot die ever.
2. Semi-persistent objects - this category includes all PID folders,
and subdirectories to the PID folders. It also includes exposed objects
like the unveil JSON'ed blob. These object are persistent as long as the
the responsible process they represent is still alive.
3. Dynamic objects - this category includes files in the subdirectories
of a PID folder, like /proc/PID/fd/* or /proc/PID/stacks/*. Essentially,
these objects are always created dynamically and when no longer in need
after being used, they're deallocated.
Nevertheless, the new allocated backend objects and inodes try to use
the same InodeIndex if possible - this might change only when a thread
dies and a new thread is born with a new thread stack, or when a file
descriptor is closed and a new one within the same file descriptor
number is opened. This is needed to actually be able to do something
useful with these objects.
The new design assures that many ProcFS instances can be used at once,
with one backend for usage for all instances.
The intention is to add dynamic mechanism for notifying the userspace
about hotplug events. Currently, the DMI (SMBIOS) blobs and ACPI tables
are exposed in the new filesystem.
This commit converts naked `new`s to `AK::try_make` and `AK::try_create`
wherever possible. If the called constructor is private, this can not be
done, so we instead now use the standard-defined and compiler-agnostic
`new (nothrow)`.
This fixes#8133.
Ext2FSInode::remove_child() searches the lookup cache, so if it's not
initialized, removing the child fails. If the child was a directory,
this led to it being corrupted and having 0 children.
I also added populate_lookup_cache to add_child. I hadn't seen any
bugs there, but if the cache wasn't populated before, adding that
one entry would make it think it was populated, so that would cause
bugs later.
inode identifiers in ProcFS are encoded in a way that the parent ID is
shifted 12 bits to the left and the PID is shifted by 16 bits. This
means that the rightmost 12 bits are reserved for the file type or the
fd.
Since the to_fd and to_proc_file_type decoders only decoded the
rightmost 8 bits, decoded values would wrap around beyond values of 255,
resulting in a different value compared to what was originally encoded.
This is necessary since the Device class does not hold a reference to
its inode (because there could be multiple), and thus doesn't override
File::stat(). For simplicity, we should just always stat via the inode
if there is one, since that shouldn't ever be the wrong thing.
This partially reverts #7867.
Instead of initializing network adapters in init.cpp, let's move that
logic into a separate class to handle this.
Also, it seems like a good idea to shift responsiblity on enumeration
of network adapters after the boot process, so this singleton will take
care of finding the appropriate network adapter when asked to with an
IPv4 address or interface name.
With this change being merged, we simplify the creation logic of
NetworkAdapter derived classes, so we enumerate the PCI bus only once,
searching for driver candidates when doing so, and we let each driver
to test if it is resposible for the specified PCI device.
If m_unveiled_paths.is_empty(), the root node (which is m_unveiled_paths
itself) is the matching veil. This means we should not return nullptr in
this case, but just use the code path for the general case.
This fixes a bug where calling e.g. unveil("/", "r") would refuse you
access to anything, because find_matching_unveiled_path would wrongly
return nullptr.
Since find_matching_unveiled_path can no longer return nullptr, we can
now just return a reference instead.
Problem:
- `static` variables consume memory and sometimes are less
optimizable.
- `static const` variables can be `constexpr`, usually.
- `static` function-local variables require an initialization check
every time the function is run.
Solution:
- If a global `static` variable is only used in a single function then
move it into the function and make it non-`static` and `constexpr`.
- Make all global `static` variables `constexpr` instead of `const`.
- Change function-local `static const[expr]` variables to be just
`constexpr`.
This modifies the error checks in VFS::open after the call to
resolve_path to ignore a null parent custody if there is no error, as
this is expected when the path to resolve points to "/". Rather, a null
parent custody only constitutes an error if it is accompanied by ENOENT.
This behavior is documented in the VFS::resolve_path_without_veil
method.
To accompany this change, the order of the error checks have been
changed to more naturally fit the new logic.
By constraining two implementations, the compiler will select the best
fitting one. All this will require is duplicating the implementation and
simplifying for the `void` case.
This constraining also informs both the caller and compiler by passing
the callback parameter types as part of the constraint
(e.g.: `IterationFunction<int>`).
Some `for_each` functions in LibELF only take functions which return
`void`. This is a minimal correctness check, as it removes one way for a
function to incompletely do something.
There seems to be a possible idiom where inside a lambda, a `return;` is
the same as `continue;` in a for-loop.
The make<T> factory function allocates internally and immediately
dereferences the pointer, and always returns a NonnullOwnPtr<T> making
it impossible to propagate an error on OOM.
This patch modifies InodeWatcher to switch to a one watcher, multiple
watches architecture. The following changes have been made:
- The watch_file syscall is removed, and in its place the
create_iwatcher, iwatcher_add_watch and iwatcher_remove_watch calls
have been added.
- InodeWatcher now holds multiple WatchDescriptions for each file that
is being watched.
- The InodeWatcher file descriptor can be read from to receive events on
all watched files.
Co-authored-by: Gunnar Beutner <gunnar@beutner.name>
The get_dir_entries syscall failed if the serialized form of all the
directory entries together was too large to fit in its temporary buffer.
Now the kernel uses a fixed size buffer, that is flushed to an output
buffer when it is full. If this flushing operation fails because there
is not enough space available, the syscall will return -EINVAL. That
error code is then used in userspace as a signal to allocate a larger
buffer and retry the syscall.
Instead of reading in the entire contents of a directory into a large
buffer, we can iterate block by block. This only requires a small
buffer.
Because directory entries are guaranteed to never span multiple blocks
we do not have to handle any edge cases related to that.
In VFS::rename, if new_path is equal to '/', then, parent custody is
set to null.
VFS::rename would then use parent custody without checking it first.
Fixed VFS::rename to check both old and new path parent custody
before actually using them.
write_bytes is called with a count of 0 bytes if a directory is being
deleted, because in that case even the . and .. pseudo directories are
getting removed. In this case write_bytes is now a no-op.
Before write_bytes would fail because it would check to see if there
were any blocks available to write in (even though it wasn't going to
write in them anyway).
This behaviour was uncovered because of a recent change where
directories are correctly reduced in size. Which in this case results in
all the blocks being removed from the inode, whereas previously there
would be some stale blocks around to pass the check.
e2fsck considers all blocks reachable through any of the pointers in
m_raw_inode.i_block as part of this inode regardless of the value in
m_raw_inode.i_size. When it finds more blocks than the amount that
is indicated by i_size or i_blocks it offers to repair the filesystem
by changing those values. That will actually cause further corruption.
So we must zero all pointers to blocks that are now unused.
Ext2 directory contents are stored in a linked list of ext2_dir_entry
structs. There is no sentinel value to determine where the list ends.
Instead the list fills the entirety of the allocated space for the
inode.
Previously the inode was not correctly resized when it became smaller.
This resulted in stale data being interpreted as part of the linked list
of directory entries.
We use a global setting to determine if Caps Lock should be remapped to
Control because we don't care how keyboard events come in, just that they
should be massaged into different scan codes.
The `proc` filesystem is able to manipulate this global variable using
the `sysctl` utility like so:
```
# sysctl caps_lock_to_ctrl=1
```
When using `sysctl` you can enable/disable values by writing to the
ProcFS. Some drift must have occured where writing was failing due to
a missing `set_mtime` call. Whenever one `write`'s a file the modified
time (mtime) will be updated so we need to implement this interface in
ProcFS.
Previously we would return a 0xdeadc0de frame for every kernel frame
in the real kernel stack when an non super-user issued the request.
This isn't useful, and just produces visual clutter in tools which
attempt to symbolize stacks.
While hacking on `sysctl` an issue in ProcFS was making me unable to
read/write from `/proc/sys/XXX`. Some directories in the ProcFS are not
actually backed by a process and need to return `nullptr` so callbacks
get properly set. We now do an explicit check for the parent to ensure
it's one that is PID-based.
The error handling in all these cases was still using the old style
negative values to indicate errors. We have a nicer solution for this
now with KResultOr<T>. This change switches the interface and then all
implementers to use the new style.
The dance here is not complicated, but it is something that should
be taken note of. Since we append to both lists, we don't want to
orphan the new Inode in the m_links/m_subfolders Vector in the event
that the append to m_parent_fs.m_nodes fails.
When there is more than one file descriptor for a file closing
one of them should not close the underlying file.
Previously this relied on the file's ref_count() but at least
for sockets this didn't work reliably.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
This should allow creating intrusive lists that have smart pointers,
while remaining free (compared to the impl before this commit) when
holding raw pointers :^)
As a sidenote, this also adds a `RawPtr<T>` type, which is just
equivalent to `T*`.
Note that this does not actually use such functionality, but is only
expected to pave the way for #6369, to replace NonnullRefPtrVector<T>
with intrusive lists.
As it is with zero-cost things, this makes the interface a bit less nice
by requiring the type name of what an `IntrusiveListNode` holds (and
optionally its container, if not RawPtr), and also requiring the type of
the container (normally `RawPtr`) on the `IntrusiveList` instance.
This should provide some speed up, as currently searches for regions
containing a given address were performed in O(n) complexity, while
this container allows us to do those in O(logn).
This pattern felt really cluttery:
auto result = something();
if (result.is_error())
return result;
Since it leaves "result" lying around in the no-error case.
Let's use some C++17 if initializer expressions to improve this:
if (auto result = something(); result.is_error())
return result;
Now the "result" goes out of scope if we don't need it anymore.
This is doubly nice since we're also free to reuse the "result"
name later in the same function.
It's perfectly valid for ext2 inodes to have blocks with index 0.
It means that no physical block was allocated for that area of an inode
and we should treat it as if it's filled with zeroes.
Fixes#6139.
The end goal of this commit is to allow to boot on bare metal with no
PS/2 device connected to the system. It turned out that the original
code relied on the existence of the PS/2 keyboard, so VirtualConsole
called it even though ACPI indicated the there's no i8042 controller on
my real machine because I didn't plug any PS/2 device.
The code is much more flexible, so adding HID support for other type of
hardware (e.g. USB HID) could be much simpler.
Briefly describing the change, we have a new singleton called
HIDManagement, which is responsible to initialize the i8042 controller
if exists, and to enumerate its devices. I also abstracted a bit
things, so now every Human interface device is represented with the
HIDDevice class. Then, there are 2 types of it - the MouseDevice and
KeyboardDevice classes; both are responsible to handle the interface in
the DevFS.
PS2KeyboardDevice, PS2MouseDevice and VMWareMouseDevice classes are
responsible for handling the hardware-specific interface they are
assigned to. Therefore, they are inheriting from the IRQHandler class.
Alot of code is shared between i386/i686/x86 and x86_64
and a lot probably will be used for compatability modes.
So we start by moving the headers into one Directory.
We will probalby be able to move some cpp files aswell.
If we happen to read with offset that is after the end of file of a
device, return 0 to indicate EOF. If we return a negative value,
userspace will think that something bad happened when it's really not
the case.
This was using up to 12KB of kernel stack in the triply indirect case
and looks generally spooky. Let's just allocate a ByteBuffer for now
and take the performance hit (of heap allocation). Longer term we can
reorganize the code to reduce the majority of the heap churn.
Switch to using type-safe bitwise operators for the BlockFlags class,
this cleans up a lot of boilerplate casts which are necessary when the
enum is declared as `enum class`.
As it turns out, Dr. POSIX doesn't require that post-mmap() changes
to a file are reflected in the memory mappings. So we don't actually
have to care about the file size changing (or the contents.)
IIUC, as long as all the MAP_SHARED mappings that refer to the same
inode are in sync, we're good.
This means that VMObjects don't need resizing capabilities. I'm sure
there are ways we can take advantage of this fact.
The perfcore file format was previously limited to a single process
since the pid/executable/regions data was top-level in the JSON.
This patch moves the process-specific data into a top-level array
named "processes" and we now add entries for each process that has
been sampled during the profile run.
This makes it possible to see samples from multiple threads when
viewing a perfcore file with Profiler. This is extremely cool! :^)
The superuser can now call sys$profiling_enable() with PID -1 to enable
profiling of all running threads in the system. The perf events are
collected in a global PerformanceEventBuffer (currently 32 MiB in size.)
The events can be accessed via /proc/profile
This may seem like a no-op change, however it shrinks down the Kernel by a bit:
.text -432
.unmap_after_init -60
.data -480
.debug_info -673
.debug_aranges 8
.debug_ranges -232
.debug_line -558
.debug_str -308
.debug_frame -40
With '= default', the compiler can do more inlining, hence the savings.
I intentionally omitted some opportunities for '= default', because they
would increase the Kernel size.
We don't need to flush the on-disk inode struct multiple times while
writing out its block list. Just mark the in-memory Inode as having
dirty metadata and the SyncTask will flush it eventually.
Since the inode is the logical owner of its block list, let's move the
code that computes the block list there, and also stop hogging the FS
lock while we compute the block list, as there is no need for it.
There are two locks in the Ext2FS implementation:
* The FS lock (Ext2FS::m_lock)
This governs access to the superblock, block group descriptors,
and the block & inode bitmap blocks. It's held while allocating
or freeing blocks/inodes.
* The inode lock (Ext2FSInode::m_lock)
This governs access to the inode metadata, including the block
list, and to the content data as well. It's held while doing
basically anything with the inode.
Once an on-disk block/inode is allocated, it logically belongs
to the in-memory Inode object, so there's no need for the FS lock
to be taken while manipulating them, the inode lock is all you need.
This dramatically reduces the impact of disk I/O on path resolution
and various other things that look at individual inodes.
Since these filesystems operate on an underlying file descriptor
and rely on its offset for correctness, let's use the FS lock to
serialize these operations.
This also means that FS subclasses can rely on block-level read/write
operations being atomic.
This reverts commit 1e737a5c50.
The cached block list does not include meta-blocks, so we'd end up
leaking those. There's definitely a nice way to avoid work here, but it
turns out it wasn't quite this trivial. Reverting for now.
This patch combines inode the scan for an available inode with the
updating of the bit in the inode bitmap into a single operation.
We also exit the scan immediately when we find an inode, instead of
continuing until we've scanned all the eligible groups(!)
Finally, we stop holding the filesystem lock throughout the entire
operation, and instead only take it while actually necessary
(during inode allocation, flush, and inode cache update.)
Improve a bunch of situations where we'd previously panic the kernel
on failure. We now propagate whatever error we had instead. Usually
that'll be EIO.
Both inode and block allocation operate on bitmap blocks and update
counters in the superblock and group descriptor.
Since we're here, also add some error propagation around this code.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
There's no real system here, I just added it to various functions
that I don't believe we ever want to call after initialization
has finished.
With these changes, we're able to unmap 60 KiB of kernel text
after init. :^)
This enabled trivial ASLR bypass for non-dumpable programs by simply
opening /proc/PID/vm before exec'ing.
We now hold the target process's ptrace lock across the refresh/write
operations, and deny access if the process is non-dumpable. The lock
is necessary to prevent a TOCTOU race on Process::is_dumpable() while
the target is exec'ing.
Fixes#5270.
In preparation for marking BlockingResult [[nodiscard]], there are a few
places that perform infinite waits, which we never observe the result of
the wait. Instead of suppressing them, add an alternate function which
returns void when performing and infinite wait.
Now that we no longer need to support the signal trampolines being
user-accessible inside the kernel memory range, we can get rid of the
"kernel" and "user-accessible" flags on Region and simply use the
address of the region to determine whether it's kernel or user.
This also tightens the page table mapping code, since it can now set
user-accessibility based solely on the virtual address of a page.