e2fsck considers all blocks reachable through any of the pointers in
m_raw_inode.i_block as part of this inode regardless of the value in
m_raw_inode.i_size. When it finds more blocks than the amount that
is indicated by i_size or i_blocks it offers to repair the filesystem
by changing those values. That will actually cause further corruption.
So we must zero all pointers to blocks that are now unused.
Ext2 directory contents are stored in a linked list of ext2_dir_entry
structs. There is no sentinel value to determine where the list ends.
Instead the list fills the entirety of the allocated space for the
inode.
Previously the inode was not correctly resized when it became smaller.
This resulted in stale data being interpreted as part of the linked list
of directory entries.
We use a global setting to determine if Caps Lock should be remapped to
Control because we don't care how keyboard events come in, just that they
should be massaged into different scan codes.
The `proc` filesystem is able to manipulate this global variable using
the `sysctl` utility like so:
```
# sysctl caps_lock_to_ctrl=1
```
When using `sysctl` you can enable/disable values by writing to the
ProcFS. Some drift must have occured where writing was failing due to
a missing `set_mtime` call. Whenever one `write`'s a file the modified
time (mtime) will be updated so we need to implement this interface in
ProcFS.
Previously we would return a 0xdeadc0de frame for every kernel frame
in the real kernel stack when an non super-user issued the request.
This isn't useful, and just produces visual clutter in tools which
attempt to symbolize stacks.
While hacking on `sysctl` an issue in ProcFS was making me unable to
read/write from `/proc/sys/XXX`. Some directories in the ProcFS are not
actually backed by a process and need to return `nullptr` so callbacks
get properly set. We now do an explicit check for the parent to ensure
it's one that is PID-based.
The error handling in all these cases was still using the old style
negative values to indicate errors. We have a nicer solution for this
now with KResultOr<T>. This change switches the interface and then all
implementers to use the new style.
The dance here is not complicated, but it is something that should
be taken note of. Since we append to both lists, we don't want to
orphan the new Inode in the m_links/m_subfolders Vector in the event
that the append to m_parent_fs.m_nodes fails.
When there is more than one file descriptor for a file closing
one of them should not close the underlying file.
Previously this relied on the file's ref_count() but at least
for sockets this didn't work reliably.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
This should allow creating intrusive lists that have smart pointers,
while remaining free (compared to the impl before this commit) when
holding raw pointers :^)
As a sidenote, this also adds a `RawPtr<T>` type, which is just
equivalent to `T*`.
Note that this does not actually use such functionality, but is only
expected to pave the way for #6369, to replace NonnullRefPtrVector<T>
with intrusive lists.
As it is with zero-cost things, this makes the interface a bit less nice
by requiring the type name of what an `IntrusiveListNode` holds (and
optionally its container, if not RawPtr), and also requiring the type of
the container (normally `RawPtr`) on the `IntrusiveList` instance.
This should provide some speed up, as currently searches for regions
containing a given address were performed in O(n) complexity, while
this container allows us to do those in O(logn).
This pattern felt really cluttery:
auto result = something();
if (result.is_error())
return result;
Since it leaves "result" lying around in the no-error case.
Let's use some C++17 if initializer expressions to improve this:
if (auto result = something(); result.is_error())
return result;
Now the "result" goes out of scope if we don't need it anymore.
This is doubly nice since we're also free to reuse the "result"
name later in the same function.
It's perfectly valid for ext2 inodes to have blocks with index 0.
It means that no physical block was allocated for that area of an inode
and we should treat it as if it's filled with zeroes.
Fixes#6139.
The end goal of this commit is to allow to boot on bare metal with no
PS/2 device connected to the system. It turned out that the original
code relied on the existence of the PS/2 keyboard, so VirtualConsole
called it even though ACPI indicated the there's no i8042 controller on
my real machine because I didn't plug any PS/2 device.
The code is much more flexible, so adding HID support for other type of
hardware (e.g. USB HID) could be much simpler.
Briefly describing the change, we have a new singleton called
HIDManagement, which is responsible to initialize the i8042 controller
if exists, and to enumerate its devices. I also abstracted a bit
things, so now every Human interface device is represented with the
HIDDevice class. Then, there are 2 types of it - the MouseDevice and
KeyboardDevice classes; both are responsible to handle the interface in
the DevFS.
PS2KeyboardDevice, PS2MouseDevice and VMWareMouseDevice classes are
responsible for handling the hardware-specific interface they are
assigned to. Therefore, they are inheriting from the IRQHandler class.
Alot of code is shared between i386/i686/x86 and x86_64
and a lot probably will be used for compatability modes.
So we start by moving the headers into one Directory.
We will probalby be able to move some cpp files aswell.
If we happen to read with offset that is after the end of file of a
device, return 0 to indicate EOF. If we return a negative value,
userspace will think that something bad happened when it's really not
the case.
This was using up to 12KB of kernel stack in the triply indirect case
and looks generally spooky. Let's just allocate a ByteBuffer for now
and take the performance hit (of heap allocation). Longer term we can
reorganize the code to reduce the majority of the heap churn.
Switch to using type-safe bitwise operators for the BlockFlags class,
this cleans up a lot of boilerplate casts which are necessary when the
enum is declared as `enum class`.
As it turns out, Dr. POSIX doesn't require that post-mmap() changes
to a file are reflected in the memory mappings. So we don't actually
have to care about the file size changing (or the contents.)
IIUC, as long as all the MAP_SHARED mappings that refer to the same
inode are in sync, we're good.
This means that VMObjects don't need resizing capabilities. I'm sure
there are ways we can take advantage of this fact.
The perfcore file format was previously limited to a single process
since the pid/executable/regions data was top-level in the JSON.
This patch moves the process-specific data into a top-level array
named "processes" and we now add entries for each process that has
been sampled during the profile run.
This makes it possible to see samples from multiple threads when
viewing a perfcore file with Profiler. This is extremely cool! :^)
The superuser can now call sys$profiling_enable() with PID -1 to enable
profiling of all running threads in the system. The perf events are
collected in a global PerformanceEventBuffer (currently 32 MiB in size.)
The events can be accessed via /proc/profile
This may seem like a no-op change, however it shrinks down the Kernel by a bit:
.text -432
.unmap_after_init -60
.data -480
.debug_info -673
.debug_aranges 8
.debug_ranges -232
.debug_line -558
.debug_str -308
.debug_frame -40
With '= default', the compiler can do more inlining, hence the savings.
I intentionally omitted some opportunities for '= default', because they
would increase the Kernel size.
We don't need to flush the on-disk inode struct multiple times while
writing out its block list. Just mark the in-memory Inode as having
dirty metadata and the SyncTask will flush it eventually.
Since the inode is the logical owner of its block list, let's move the
code that computes the block list there, and also stop hogging the FS
lock while we compute the block list, as there is no need for it.
There are two locks in the Ext2FS implementation:
* The FS lock (Ext2FS::m_lock)
This governs access to the superblock, block group descriptors,
and the block & inode bitmap blocks. It's held while allocating
or freeing blocks/inodes.
* The inode lock (Ext2FSInode::m_lock)
This governs access to the inode metadata, including the block
list, and to the content data as well. It's held while doing
basically anything with the inode.
Once an on-disk block/inode is allocated, it logically belongs
to the in-memory Inode object, so there's no need for the FS lock
to be taken while manipulating them, the inode lock is all you need.
This dramatically reduces the impact of disk I/O on path resolution
and various other things that look at individual inodes.
Since these filesystems operate on an underlying file descriptor
and rely on its offset for correctness, let's use the FS lock to
serialize these operations.
This also means that FS subclasses can rely on block-level read/write
operations being atomic.
This reverts commit 1e737a5c50.
The cached block list does not include meta-blocks, so we'd end up
leaking those. There's definitely a nice way to avoid work here, but it
turns out it wasn't quite this trivial. Reverting for now.
This patch combines inode the scan for an available inode with the
updating of the bit in the inode bitmap into a single operation.
We also exit the scan immediately when we find an inode, instead of
continuing until we've scanned all the eligible groups(!)
Finally, we stop holding the filesystem lock throughout the entire
operation, and instead only take it while actually necessary
(during inode allocation, flush, and inode cache update.)
Improve a bunch of situations where we'd previously panic the kernel
on failure. We now propagate whatever error we had instead. Usually
that'll be EIO.
Both inode and block allocation operate on bitmap blocks and update
counters in the superblock and group descriptor.
Since we're here, also add some error propagation around this code.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
There's no real system here, I just added it to various functions
that I don't believe we ever want to call after initialization
has finished.
With these changes, we're able to unmap 60 KiB of kernel text
after init. :^)
This enabled trivial ASLR bypass for non-dumpable programs by simply
opening /proc/PID/vm before exec'ing.
We now hold the target process's ptrace lock across the refresh/write
operations, and deny access if the process is non-dumpable. The lock
is necessary to prevent a TOCTOU race on Process::is_dumpable() while
the target is exec'ing.
Fixes#5270.
In preparation for marking BlockingResult [[nodiscard]], there are a few
places that perform infinite waits, which we never observe the result of
the wait. Instead of suppressing them, add an alternate function which
returns void when performing and infinite wait.
Now that we no longer need to support the signal trampolines being
user-accessible inside the kernel memory range, we can get rid of the
"kernel" and "user-accessible" flags on Region and simply use the
address of the region to determine whether it's kernel or user.
This also tightens the page table mapping code, since it can now set
user-accessibility based solely on the virtual address of a page.
The way we read/write directories is very inefficient, and this doesn't
solve any of that. It does however reduce memory usage of directory
entry vectors by 25% which has nice immediate benefits.
This patch adds Space, a class representing a process's address space.
- Each Process has a Space.
- The Space owns the PageDirectory and all Regions in the Process.
This allows us to reorganize sys$execve() so that it constructs and
populates a new Space fully before committing to it.
Previously, we would construct the new address space while still
running in the old one, and encountering an error meant we had to do
tedious and error-prone rollback.
Those problems are now gone, replaced by what's hopefully a set of much
smaller problems and missing cleanups. :^)
We were checking for size_t (unsigned) overflow but the current offset
is actually stored as off_t (signed). Fix this, and also fail with
EOVERFLOW correctly.
This patch adds sys$msyscall() which is loosely based on an OpenBSD
mechanism for preventing syscalls from non-blessed memory regions.
It works similarly to pledge and unveil, you can call it as many
times as you like, and when you're finished, you call it with a null
pointer and it will stop accepting new regions from then on.
If a syscall later happens and doesn't originate from one of the
previously blessed regions, the kernel will simply crash the process.
We had two ways of creating a new Ext2FS inode. Either they were empty,
or they started with some pre-allocated size.
In practice, the pre-sizing code path was only used for new directories
and it didn't actually improve anything as far as I can tell.
This patch simplifies inode creation by simply always allocating empty
inodes. Block allocation and block list generation now always happens
on the same code path.
The enumeration code is already enumerating all buses, recursively
enumerating bridges (which are buses) makes devices on bridges being
enumerated multiple times. Also, the PCI code was incorrectly mixing up
terminology; let's settle down on bus, device and function because ever
since PCIe came along "slots" isn't really a thing anymore.
This eliminates the window between calling Processor::current and
the member function where a thread could be moved to another
processor. This is generally not as big of a concern as with
Processor::current_thread, but also slightly more light weight.
Instead of letting each File subclass do range allocation in their
mmap() override, do it up front in sys$mmap().
This makes us honor alignment requests for file-backed memory mappings
and simplifies the code somwhat.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
(mode & S_IFDIR) is not enough to check if "mode" is a directory,
we have to check all the bits in the S_IFMT mask.
Use the is_directory() helper to fix this bug.
Besides removing the monolithic DevFSDeviceInode::determine_name()
method, being able to determine a device's name inside the /dev
hierarchy outside of DevFS has its uses.
..and allow implicit creation of KResult and KResultOr from ErrnoCode.
This means that kernel functions that return those types can finally
do "return EINVAL;" and it will just work.
There's a handful of functions that still deal with signed integers
that should be converted to return KResults.
This way, if something goes wrong, we get to keep the actual error.
Also, KResults are nodiscard, so we have to deal with that in Ext2FS
instead of just silently ignoring I/O errors(!)
Path resolution will now refuse to follow symlinks in some cases where
you don't own the symlink, or when it's in a sticky world-writable
directory and the link has a different owner than the directory.
The point of all this is to prevent classic TOCTOU bugs in /tmp etc.
Fixes#4934
This file was useful for debugging a long time ago, but has bitrotted
at this point. Instead of updating it, let's just remove it since
nothing is using it.
When freeing an inode, we were checking if it's a directory *after*
wiping the inode metadata. This caused us to forget updating the block
group descriptor with the new directory count.
This patch adds a new AnonymousFile class which is a File backed by
an AnonymousVMObject that can only be mmap'ed and nothing else, really.
I'm hoping that this can become a replacement for shbufs. :^)
Problem:
- Many constructors are defined as `{}` rather than using the ` =
default` compiler-provided constructor.
- Some types provide an implicit conversion operator from `nullptr_t`
instead of requiring the caller to default construct. This violates
the C++ Core Guidelines suggestion to declare single-argument
constructors explicit
(https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c46-by-default-declare-single-argument-constructors-explicit).
Solution:
- Change default constructors to use the compiler-provided default
constructor.
- Remove implicit conversion operators from `nullptr_t` and change
usage to enforce type consistency without conversion.
This patch merges the profiling functionality in the kernel with the
performance events mechanism. A profiler sample is now just another
perf event, rather than a dedicated thing.
Since perf events were already per-process, this now makes profiling
per-process as well.
Processes with perf events would already write out a perfcore.PID file
to the current directory on death, but since we may want to profile
a process and then let it continue running, recorded perf events can
now be accessed at any time via /proc/PID/perf_events.
This patch also adds information about process memory regions to the
perfcore JSON format. This removes the need to supply a core dump to
the Profiler app for symbolication, and so the "profiler coredump"
mechanism is removed entirely.
There's still a hard limit of 4MB worth of perf events per process,
so this is by no means a perfect final design, but it's a nice step
forward for both simplicity and stability.
Fixes#4848Fixes#4849
We were not handling sticky parents properly in sys$rmdir(). Child
directories of a sticky parent should not be rmdir'able by just anyone.
Only the owner and root.
Fixes#4875.
Before this change, truncating an Ext2FS inode to a larger size than it
was before would give you uninitialized on-disk data.
Fix this by zeroing out all the new space when doing an inode resize.
This is pretty naively implemented via Inode::write_bytes() and there's
lots of room for cleverness here in the future.
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.Everything:
The modifications in this commit were automatically made using the
following command:
find . -name '*.cpp' -exec sed -i -E 's/dbg\(\) << ("[^"{]*");/dbgln\(\1\);/' {} \;
When ProcFS could no longer allocate KBuffer objects to serve calls to
read, it would just return 0, indicating EOF. This then triggered
parsing errors because code assumed it read the file.
Because read isn't supposed to return ENOMEM, change ProcFS to populate
the file data upon file open or seek to the beginning. This also means
that calls to open can now return ENOMEM if needed. This allows the
caller to either be able to successfully open the file and read it, or
fail to open it in the first place.
There is a window between dropping the last reference and removing
a ProcFSInode from the lookup map. So, when looking up we need to
check if that Inode is being destructed.
Before this change, we would sometimes map a region into the address
space with !is_shared(), and then moments later call set_shared(true).
I found this very confusing while debugging, so this patch makes us pass
the initial shared flag to the Region constructor, ensuring that it's in
the correct state by the time we first map the region.
This adds the ability for a Region to define volatile/nonvolatile
areas within mapped memory using madvise(). This also means that
memory purging takes into account all views of the PurgeableVMObject
and only purges memory that is not needed by all of them. When calling
madvise() to change an area to nonvolatile memory, return whether
memory from that area was purged. At that time also try to remap
all memory that is requested to be nonvolatile, and if insufficient
pages are available notify the caller of that fact.
Compared to version 10 this fixes a bunch of formatting issues, mostly
around structs/classes with attributes like [[gnu::packed]], and
incorrect insertion of spaces in parameter types ("T &"/"T &&").
I also removed a bunch of // clang-format off/on and FIXME comments that
are no longer relevant - on the other hand it tried to destroy a couple of
neatly formatted comments, so I had to add some as well.
The unblock_all variant used to ASSERT if a blocker didn't unblock,
but it wasn't clear from the name that it would do that. Because
the BlockCondition already asserts that no blockers are left at
destruction time, it would still catch blockers that haven't been
unblocked for whatever reason.
Fixes#4496
The partitioning code was very outdated, and required a full refactor.
The new subsystem removes duplicated code and uses more AK containers.
The most important change is that all implementations of the
PartitionTable class conform to one interface, which made it possible
to remove unnecessary code in the EBRPartitionTable class.
Finding partitions is now done in the StorageManagement singleton,
instead of doing so in init.cpp.
Also, now we don't try to find partitions on demand - the kernel will
try to detect if a StorageDevice is partitioned, and if so, will check
what is the partition table, which could be MBR, GUID or EBR.
Then, it will create DiskPartitionMetadata object for each partition
that is available in the partition table. This object will be used
by the partition enumeration code to create a DiskPartition with the
correct minor number.
The DevFS along with DevPtsFS give a complete solution for populating
device nodes in /dev. The main purpose of DevFS is to eliminate the
need of device nodes generation when building the system.
Later on, DevFS will assist with exposing disk partition nodes.
BlockBasedFileSystem::read_block method should get a reference of
a UserOrKernelBuffer.
If we need to force caching a block, we will call other method to do so.
clang trunk with -std=c++20 doesn't seem to properly look for an
aggregate initializer here when the type being constructed is a simple
aggregate (e.g. `struct Thing { int a; int b; };`). This template fails
to compile in a usage added 12/16/2020 in `AK/Trie.h`.
Both forms of initialization are supposed to call the
aggregate-initializers but direct-list-initialization delegating to
aggregate initializers is a new addition in c++20 that might not be
implemented yet.
This was a goofy kernel API where you could assign an icon_id (int) to
a process which referred to a global shbuf with a 16x16 icon bitmap
inside it.
Instead of this, programs that want to display a process icon now
retrieve it from the process executable instead.
This new flag controls two things:
- Whether the kernel will generate core dumps for the process
- Whether the EUID:EGID should own the process's files in /proc
Processes are automatically made non-dumpable when their EUID or EGID is
changed, either via syscalls that specifically modify those ID's, or via
sys$execve(), when a set-uid or set-gid program is executed.
A process can change its own dumpable flag at any time by calling the
new sys$prctl(PR_SET_DUMPABLE) syscall.
Fixes#4504.
This is instead of the UID:GID, since that was allowing some very bad
information leaks like spawning "su" as an unprivileged user and having
full /proc access to it.
Work towards #4504.
ProcFS /proc/<pid>/vm map info no longer contains two `purgeable` keys.
The second `purgeable` key has been removed and replaced with keys for
`kernel` and `cacheable`.
This implements a number of changes related to time:
* If a HPET is present, it is now used only as a system timer, unless
the Local APIC timer is used (in which case the HPET timer will not
trigger any interrupts at all).
* If a HPET is present, the current time can now be as accurate as the
chip can be, independently from the system timer. We now query the
HPET main counter for the current time in CPU #0's system timer
interrupt, and use that as a base line. If a high precision time is
queried, that base line is used in combination with quering the HPET
timer directly, which should give a much more accurate time stamp at
the expense of more overhead. For faster time stamps, the more coarse
value based on the last interrupt will be returned. This also means
that any missed interrupts should not cause the time to drift.
* The default system interrupt rate is reduced to about 250 per second.
* Fix calculation of Thread CPU usage by using the amount of ticks they
used rather than the number of times a context switch happened.
* Implement CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE and use it
for most cases where precise timestamps are not needed.
Problem:
- `(void)` simply casts the expression to void. This is understood to
indicate that it is ignored, but this is really a compiler trick to
get the compiler to not generate a warning.
Solution:
- Use the `[[maybe_unused]]` attribute to indicate the value is unused.
Note:
- Functions taking a `(void)` argument list have also been changed to
`()` because this is not needed and shows up in the same grep
command.
Fix some problems with join blocks where the joining thread block
condition was added twice, which lead to a crash when trying to
unblock that condition a second time.
Deferred block condition evaluation by File objects were also not
properly keeping the File object alive, which lead to some random
crashes and corruption problems.
Other problems were caused by the fact that the Queued state didn't
handle signals/interruptions consistently. To solve these issues we
remove this state entirely, along with Thread::wait_on and change
the WaitQueue into a BlockCondition instead.
Also, deliver signals even if there isn't going to be a context switch
to another thread.
Fixes#4336 and #4330
This makes the Scheduler a lot leaner by not having to evaluate
block conditions every time it is invoked. Instead evaluate them as
the states change, and unblock threads at that point.
This also implements some more waitid/waitpid/wait features and
behavior. For example, WUNTRACED and WNOWAIT are now supported. And
wait will now not return EINTR when SIGCHLD is delivered at the
same time.
Use the TimerQueue to expire blocking operations, which is one less thing
the Scheduler needs to check on every iteration.
Also, add a BlockTimeout class that will automatically handle relative or
absolute timeouts as well as overriding timeouts (e.g. socket timeouts)
more consistently.
Also, rework the TimerQueue class to be able to fire events from
any processor, which requires Timer to be RefCounted. Also allow
creating id-less timers for use by blocking operations.
This makes misses in the BlockBasedFS's LRU block cache faster by
storing the cache entries in one of two doubly-linked list.
Dirty and clean cache entries are kept in two separate lists, and
move between them when their state changes. This can probably be
improved upon further.
If the inode's block list cache is empty, we forgot to assign the
result of computing the block list. The fact that this worked anyway
makes me wonder when we actually don't have a cache..
Thanks to szyszkienty for spotting this! :^)
Instead of doing a linear scan of the entire cache when doing a lookup,
we now have a nice O(1) HashMap in front of the cache.
The cache miss case can still be improved, this patch really only helps
the cache hit case.
This dramatically improves cached filesystem I/O. :^)
Since we're using UserOrKernelBuffers, SMAP will be automatically
disabled when we actually access the buffer later on. There's no need
to disable it wholesale across the entire read/write operations.
This is a new "browse" permission that lets you open (and subsequently list
contents of) directories underneath the path, but not regular files or any other
types of files.
e2fsck was complaining about blocks being allocated in an inode's list
of direct blocks while at the same time being free in the block bitmap.
It was easy to reproduce by creating a file with non-zero length and
then truncating it. This fixes the issue by clearing out the direct
block list when resizing a file to 0.
This makes most operations thread safe, especially so that they
can safely be used in the Kernel. This includes obtaining a strong
reference from a weak reference, which now requires an explicit
call to WeakPtr::strong_ref(). Another major change is that
Weakable::make_weak_ref() may require the explicit target type.
Previously we used reinterpret_cast in WeakPtr, assuming that it
can be properly converted. But WeakPtr does not necessarily have
the knowledge to be able to do this. Instead, we now ask the class
itself to deliver a WeakPtr to the type that we want.
Also, WeakLink is no longer specific to a target type. The reason
for this is that we want to be able to safely convert e.g. WeakPtr<T>
to WeakPtr<U>, and before this we just reinterpret_cast the internal
WeakLink<T> to WeakLink<U>, which is a bold assumption that it would
actually produce the correct code. Instead, WeakLink now operates
on just a raw pointer and we only make those constructors/operators
available if we can verify that it can be safely cast.
In order to guarantee thread safety, we now use the least significant
bit in the pointer for locking purposes. This also means that only
properly aligned pointers can be used.
When computing the list of blocks to deallocate when freeing an inode,
we would stop collecting blocks after reaching the inode's block count.
Since we're getting rid of the inode, we need to also include the meta
blocks used by the on-disk block list itself.
If you try to do this (e.g "mv directory directory"), sys$rename() will
now fail with EDIRINTOSELF.
Dr. POSIX says we should return EINVAL for this, but a custom error
code allows us to print a much more helpful error message when this
problem occurs. :^)
Problem:
- `constexpr` functions are decorated with the `inline` specifier
keyword. This is redundant because `constexpr` functions are
implicitly `inline`.
- [dcl.constexpr], §7.1.5/2 in the C++11 standard): "constexpr
functions and constexpr constructors are implicitly inline (7.1.2)".
Solution:
- Remove the redundant `inline` keyword.
The block group indices are 1-based for some reason. Because of that,
we were forgetting to check in the very last block group when doing
block allocation. This caused block allocation to fail even when the
superblock indicated that we had free blocks.
Fixes#3674.
First of all, this fixes a dumb info leak where we'd write kernel heap
addresses (StringImpl*) into userspace memory when reading a watcher.
Instead of trying to pass names to userspace, we now simply pass the
child inode index. Nothing in userspace makes use of this yet anyway,
so it's not like we're breaking anything. We'll see how this evolves.
Before e06362de9487806df92cf2360a42d3eed905b6bf this was a sneaky buffer
overflow. BufferStream did not do range checking and continued to write
past the allocated buffer (the size of which was controlled by the
user.)
The issue surfaced after my changes because OutputMemoryStream does
range checking.
Not sure how exploitable that bug was, directory entries are somewhat
controllable by the user but the buffer was on the heap, so exploiting
that should be tough.
Since the CPU already does almost all necessary validation steps
for us, we don't really need to attempt to do this. Doing it
ourselves doesn't really work very reliably, because we'd have to
account for other processors modifying virtual memory, and we'd
have to account for e.g. pages not being able to be allocated
due to insufficient resources.
So change the copy_to/from_user (and associated helper functions)
to use the new safe_memcpy, which will return whether it succeeded
or not. The only manual validation step needed (which the CPU
can't perform for us) is making sure the pointers provided by user
mode aren't pointing to kernel mappings.
To make it easier to read/write from/to either kernel or user mode
data add the UserOrKernelBuffer helper class, which will internally
either use copy_from/to_user or directly memcpy, or pass the data
through directly using a temporary buffer on the stack.
Last but not least we need to keep syscall params trivial as we
need to copy them from/to user mode using copy_from/to_user.
I decided to modify MappedROM.h because all other entried in Forward.h
are also classes, and this is visually more pleasing.
Other than that, it just doesn't make any difference which way we resolve
the conflicts.
Since "rings" typically refer to code execution and user processes
can also execute in ring 0, rename these functions to more accurately
describe what they mean: kernel processes and user processes.
Instead of FileDescriptor branching on the type of File it's wrapping,
add a File::stat() function that can be overridden to provide custom
behavior for the stat syscalls.
Add an ExpandableHeap and switch kmalloc to use it, which allows
for the kmalloc heap to grow as needed.
In order to make heap expansion to work, we keep around a 1 MiB backup
memory region, because creating a region would require space in the
same heap. This means, the heap will grow as soon as the reported
utilization is less than 1 MiB. It will also return memory if an entire
subheap is no longer needed, although that is rarely possible.
A change introduced in 5e01234 made it the resposibility of each
filesystem to have the file types returned from
'traverse_as_directory' match up with the DT_* types.
However, this caused corruption of the Ext2FS file format because
the Ext2FS uses 'traverse_as_directory' internally when manipulating
the file system. The result was a mixture between EXT2_FT_* and DT_*
file types in the internal Ext2FS structures.
Starting with this commit, the conversion from internal filesystem file
types to the user facing DT_* types happens at a later stage,
in the 'FileDescription::get_dir_entries' function which is directly
used by sys$get_dir_entries.
This fixes an issue we had in the git port where git would not
recognize untracked files (for example in 'git status').
When git used readdir, the 'd_type' field in the dirent struct contained
bad values (Specifically, it contained the values defiend in
Kernel/FileSystem/ext2_fs.h instead of the ones in LibC/dirent.h).
After this fix, we can create a new git repository with 'git init', and
then stage and commit files as usual.
MemoryManager cannot use the Singleton class because
MemoryManager::initialize is called before the global constructors
are run. That caused the Singleton to be re-initialized, causing
it to create another MemoryManager instance.
Fixes#3226
Unlike DirectoryEntry (which is used when constructing directories),
DirectoryEntryView does not manage storage for file names. Names are
just StringViews.
This is much more suited to the directory traversal API and makes
it easier to implement this in file system classes since they no
longer need to create temporary name copies while traversing.
Pledges and Veil state don't really make sense for kernel mode
processes, as they can do what ever they want since they are in
kernel mode. Make this clear in the system monitor UI by marking
these entries as null.
The SI prefixes "k", "M", "G" mean "10^3", "10^6", "10^9".
The IEC prefixes "Ki", "Mi", "Gi" mean "2^10", "2^20", "2^30".
Let's use the correct name, at least in code.
Only changes the name of the constants, no other behavior change.
This would have caused an issue later when we enable -Wmissing-declarations, as
the compiler didn't see that Kernel::all_inodes() was being used elsewhere, too.
Also, this means that if the type changes later, there's not going to be weird
run-time issues, but rather a nice type error during compile time.