For Kernel OOM hardening to work correctly, we need to be able to
call a "nothrow" version of operator new. Unfortunately the default
"throwing" version of operator new assumes that the allocation will
never return on failure and will always throw an exception. This isn't
true in the Kernel, as we don't have exceptions. So if we call the
normal/throwing new and kmalloc returns NULL, the generated code will
happily go and dereference that NULL pointer by invoking the constructor
before we have a chance to handle the failure.
To fix this we declare operator new as noexcept in the Kernel headers,
which will allow the caller to actually handle allocation failure.
The delete implementations need to match the prototype of the new which
allocated them, so we need define delete as noexcept as well. GCC then
errors out declaring that you should implement sized delete as well, so
this change provides those stubs in order to compile cleanly.
Finally the new operator definitions have been standardized as being
declared with [[nodiscard]] to avoid potential memory leaks. So lets
declares the kernel versions that way as well.
This avoids some of the the shortest-lived allocations in the kernel:
StringImpl::create_uninitialized(unsigned long, char*&)
StringImpl::create(char const*, unsigned long, ShouldChomp)
StringBuilder::to_string() const
String::vformatted(StringView, TypeErasedFormatParams)
void Kernel::KBufferBuilder::appendff<unsigned int>(...)
JsonObjectSerializer<Kernel::KBufferBuilder>::add(..., unsigned int)
Kernel::procfs$all(Kernel::InodeIdentifier, ...) const
Kernel::procfs$all(Kernel::InodeIdentifier, Kernel::KBufferBuilder&)
This avoids allocations for initializing the Function<T>
for the NetworkAdapter::for_each callback argument.
Applying this patch decreases CPU utilization for NetworkTask
from 40% to 28% when receiving TCP packets at a rate of 100Mbit/s.
We already have another limit for the total number of packet buffers
allowed (max_packet_buffers). This second limit caused us to
repeatedly allocate and then free buffers.
This patch modifies InodeWatcher to switch to a one watcher, multiple
watches architecture. The following changes have been made:
- The watch_file syscall is removed, and in its place the
create_iwatcher, iwatcher_add_watch and iwatcher_remove_watch calls
have been added.
- InodeWatcher now holds multiple WatchDescriptions for each file that
is being watched.
- The InodeWatcher file descriptor can be read from to receive events on
all watched files.
Co-authored-by: Gunnar Beutner <gunnar@beutner.name>
If the HPET main counter does not support full 64 bits, we should
not expect the upper 32 bit to work. This is a problem when writing
to the upper 32 bit of the comparator value, which requires the
TimerConfiguration::ValueSet bit to be set, but if it's not 64 bit
capable then the bit will not be cleared and leave it in a bad state.
Fixes#6990
This matches what other operating systems like Linux do:
$ ip route get 0.0.0.0
local 0.0.0.0 dev lo src 127.0.0.1 uid 1000
cache <local>
$ ssh 0.0.0.0
gunnar@0.0.0.0's password:
$ ss -na | grep :22 | grep ESTAB
tcp ESTAB 0 0 127.0.0.1:43118 127.0.0.1:22
tcp ESTAB 0 0 127.0.0.1:22 127.0.0.1:43118
When we receive a TCP packet with a sequence number that is not what
we expected we have lost one or more packets. We can signal this to
the sender by sending a TCP ACK with the previous ack number so that
they can resend the missing TCP fragments.
Previously we'd process TCP packets in whatever order we received
them in. In the case where packets arrived out of order we'd end
up passing garbage to the userspace process.
This was most evident for TLS connections:
courage:~ $ git clone https://github.com/SerenityOS/serenity
Cloning into 'serenity'...
remote: Enumerating objects: 178826, done.
remote: Counting objects: 100% (1880/1880), done.
remote: Compressing objects: 100% (907/907), done.
error: RPC failed; curl 56 OpenSSL SSL_read: error:1408F119:SSL
routines:SSL3_GET_RECORD:decryption failed or bad record mac, errno 0
error: 1918 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
When the MSS option header is missing the default maximum segment
size is 536 which results in lots of very small TCP packets that
NetworkTask has to handle.
This adds the MSS option header to outbound TCP SYN packets and
sets it to an appropriate value depending on the interface's MTU.
Note that we do not currently do path MTU discovery so this could
cause problems when hops don't fragment packets properly.
This increases the default TCP window size to a more reasonable
value of 64k. This allows TCP peers to send us more packets before
waiting for corresponding ACKs.
This increases the buffer size for connection-oriented sockets
to 256kB. In combination with the other patches in this series
I was able to receive TCP packets at a rate of about 120Mbps.
The get_dir_entries syscall failed if the serialized form of all the
directory entries together was too large to fit in its temporary buffer.
Now the kernel uses a fixed size buffer, that is flushed to an output
buffer when it is full. If this flushing operation fails because there
is not enough space available, the syscall will return -EINVAL. That
error code is then used in userspace as a signal to allocate a larger
buffer and retry the syscall.
Previously we'd try to load ELF images which did not have
an interpreter set with an incorrect load offset of 0, i.e. way
outside of the part of the address space where we'd expect either
the dynamic loader or the user's executable to reside.
This fixes the problem by using get_load_offset for both executables
which have an interpreter set and those which don't. Notably this
allows us to actually successfully execute the Loader.so binary:
courage:~ $ /usr/lib/Loader.so
You have invoked `Loader.so'. This is the helper program for programs
that use shared libraries. Special directives embedded in executables
tell the kernel to load this program.
This helper program loads the shared libraries needed by the program,
prepares the program to run, and runs it. You do not need to invoke
this helper program directly.
courage:~ $
Previously we'd incorrectly use the default gateway's MAC address.
Instead we must use destination MAC addresses that are derived from
the multicast IPv4 address.
With this patch applied I can query mDNS on a real network.
The Kernel/.gitignore file is a remnant of the prior build system,
where the kernel.map was written directly to to the Kernel folder.
The run.sh was also under Kernel so pcap files and others would get
dropped there when running the system under qemu.
None of these situations are possible now, so lets get rid of it.
Instead of reading in the entire contents of a directory into a large
buffer, we can iterate block by block. This only requires a small
buffer.
Because directory entries are guaranteed to never span multiple blocks
we do not have to handle any edge cases related to that.
On some cases, the FADT could be on the end of a page, so if we don't
have two pages being mapped, we could easily read from a non-mapped
virtual address, which will trigger the UB sanitizer.
Also, we need to treat the FADT structure as volatile and const, as it
may change at any time, but we should not touch (write) it anyhow.
In VFS::rename, if new_path is equal to '/', then, parent custody is
set to null.
VFS::rename would then use parent custody without checking it first.
Fixed VFS::rename to check both old and new path parent custody
before actually using them.
write_bytes is called with a count of 0 bytes if a directory is being
deleted, because in that case even the . and .. pseudo directories are
getting removed. In this case write_bytes is now a no-op.
Before write_bytes would fail because it would check to see if there
were any blocks available to write in (even though it wasn't going to
write in them anyway).
This behaviour was uncovered because of a recent change where
directories are correctly reduced in size. Which in this case results in
all the blocks being removed from the inode, whereas previously there
would be some stale blocks around to pass the check.
e2fsck considers all blocks reachable through any of the pointers in
m_raw_inode.i_block as part of this inode regardless of the value in
m_raw_inode.i_size. When it finds more blocks than the amount that
is indicated by i_size or i_blocks it offers to repair the filesystem
by changing those values. That will actually cause further corruption.
So we must zero all pointers to blocks that are now unused.
The current method of emitting performance events requires a bit of
boiler plate at every invocation, as well as having to ignore the
return code which isn't used outside of the perf event syscall. This
change attempts to clean that up by exposing high level API's that
can be used around the code base.
Ext2 directory contents are stored in a linked list of ext2_dir_entry
structs. There is no sentinel value to determine where the list ends.
Instead the list fills the entirety of the allocated space for the
inode.
Previously the inode was not correctly resized when it became smaller.
This resulted in stale data being interpreted as part of the linked list
of directory entries.
When reading UDP packets from userspace with recvmsg()/recv() we
would hit a VERIFY() if the supplied buffer is smaller than the
received UDP packet. Instead we should just return truncated data
to the caller.
This can be reproduced with:
$ dd if=/dev/zero bs=1k count=1 | nc -u 192.168.3.190 68
We use a global setting to determine if Caps Lock should be remapped to
Control because we don't care how keyboard events come in, just that they
should be massaged into different scan codes.
The `proc` filesystem is able to manipulate this global variable using
the `sysctl` utility like so:
```
# sysctl caps_lock_to_ctrl=1
```
An IP socket can now join a multicast group by using the
IP_ADD_MEMBERSHIP sockopt, which will cause it to start receiving
packets sent to the multicast address, even though this address does
not belong to this host.
When using `sysctl` you can enable/disable values by writing to the
ProcFS. Some drift must have occured where writing was failing due to
a missing `set_mtime` call. Whenever one `write`'s a file the modified
time (mtime) will be updated so we need to implement this interface in
ProcFS.
The fact that current_time can "fail" makes its use a bit awkward.
All callers in the Kernel are trusted besides syscalls, so assert
that they never get there, and make sure all current callers perform
validation of the clock_id with TimeManagement::is_valid_clock_id().
I have fuzzed this change locally for a bit to make sure I didn't
miss any obvious regression.
The variety of checks for Processor::id() == 0 could use some assistance
in the readability department. This change adds a new function to
represent this check, and replaces the comparison everywhere it's used.
FileDescriptionBlocker::m_should_block was shadowing the parent's
FileBlocker::m_should_block variable, which would cause should_block()
to return the wrong value.
Found by @gunnarbeutner
This solves a problem where checking whether a thread is an idle
thread may require iterating all processors if it is not the idle
thread of the current processor.
Previously we would return a 0xdeadc0de frame for every kernel frame
in the real kernel stack when an non super-user issued the request.
This isn't useful, and just produces visual clutter in tools which
attempt to symbolize stacks.
While hacking on `sysctl` an issue in ProcFS was making me unable to
read/write from `/proc/sys/XXX`. Some directories in the ProcFS are not
actually backed by a process and need to return `nullptr` so callbacks
get properly set. We now do an explicit check for the parent to ensure
it's one that is PID-based.
The error handling in all these cases was still using the old style
negative values to indicate errors. We have a nicer solution for this
now with KResultOr<T>. This change switches the interface and then all
implementers to use the new style.
Our current implementation does not work in the special case in which
both shift keys are pressed, and then only one of the keys is released,
as this would result in writing lower case letters, instead of the
expected upper case letters.
This commit fixes that by keeping track of the amount of shift keys
that are pressed (instead of if any are at all), and only switching to
the unshifted keymap once all of them are released.
The previous patch already helped with this, however my idea of only
reading a few packets didn't work and we'd still sometimes end up not
receiving any more packets from the E1000 interface.
With this patch applied my NIC seems to receive packets just fine, at
least for now.
The dance here is not complicated, but it is something that should
be taken note of. Since we append to both lists, we don't want to
orphan the new Inode in the m_links/m_subfolders Vector in the event
that the append to m_parent_fs.m_nodes fails.
Raw sockets don't need a local port, so we shouldn't fail operations
if allocation yields an ENOPROTOOPT.
I'm not in love with the factoring here, just patching up the bug.
Without this patch we end up with sockets in the listener's accept
queue with state 'closed' when doing stealth SYN scans:
Client -> Server: SYN for port 22
Server -> Client: SYN/ACK
Client -> Server: RST (i.e. don't complete the TCP handshake)
Previously, TLS data was always zero-initialized.
To support initializing the values of TLS data, sys$allocate_tls now
receives a buffer with the desired initial data, and copies it to the
master TLS region of the process.
The DynamicLinker gathers the initial TLS image and passes it to
sys$allocate_tls.
We also now require the size passed to sys$allocate_tls to be
page-aligned, to make things easier. Note that this doesn't waste memory
as the TLS data has to be allocated in separate pages anyway.
For example Linux accepts an additional argument for flags in accept4()
that let the user specify what flags they want. However, by default
accept() should not inherit those flags from the listener socket.
When there is more than one file descriptor for a file closing
one of them should not close the underlying file.
Previously this relied on the file's ref_count() but at least
for sockets this didn't work reliably.
Theoretically the append should never fail as we have in-line storage
of FD_SETSIZE, which should always be enough. However I'm planning on
removing the non-try variants of AK::Vector when compiling in kernel
mode in the future, so this will need to go eventually. I suppose it
also protects against some unforeseen bug where we we can append more
than FD_SETSIZE items.
Theoretically the append should never fail as we have in-line storage
of 2, which should be enough. However I'm planning on removing the
non-try variants of AK::Vector when compiling in kernel mode in the
future, so this will need to go eventually. I suppose it also protects
against some unforeseen bug where we we can append more than 2 items.
sys$purge() is a bit unique, in that it is probably in the systems
advantage to attempt to limp along if we hit OOM while processing
the vmobjects to purge. This change modifies the algorithm to observe
OOM and continue trying to purge any previously visited VMObjects.
GCC with -flto is more aggressive when it comes to inlining and
discarding functions which is why we must mark some of the functions
as NEVER_INLINE (because they contain asm labels which would be
duplicated in the object files if the compiler decides to inline
the function elsewhere) and __attribute__((used)) for others so
that GCC doesn't discard them.
This commit will add MSG_PEEK support, which allows a package to be
seen without taking it from the buffer, so that a subsequent recv()
without the MSG_PEEK flag can pick it up.
We had some inconsistencies before:
- Sometimes "The", sometimes "the"
- Sometimes trailing ".", sometimes no trailing "."
I picked the most common one (lowecase "the", trailing ".") and applied
it to all copyright headers.
By using the exact same string everywhere we can ensure nothing gets
missed during a global search (and replace), and that these
inconsistencies are not spread any further (as copyright headers are
commonly copied to new files).
The current implementation would only check the first name.length()
characters match, which means any kernel symbol that the provided name
is a prefix of would match, instead of the actual matching symbol.
This commit fixes that by using StringView::operator==() for the
comparison, which already checks the equality correctly.
Lots of people are confused by the error message you get when the
Toolchain is behind/messed up:
'initializer-list: No such file or directory'
Before this error can happen, catch the problem at CMake configure time,
and provide them with an actionable error message.
Make this stuff a bit easier to maintain by using the
root level variables to build up the Toolchain paths.
Also leave a note for future editors of BuildIt.sh to
give them warning about the other changes they'll need
to make.
Previously the E1000 network adapter would stop receiving further
packets when an RX buffer overrun occurred. This was the case
when connecting the adapter to a real network where enough broadcast
traffic caused the buffer to be full before the kernel had a chance
to clear the RX buffer.
The overall design is the same, but we change a few things,
like decreasing the amount of blocking forever loops. The goal
is to ensure the kernel won't hang forever when dealing with
buggy hardware.
Also, we reset the channel when initializing it, just in case the
hardware was in bad state before we start use it.
The last IP address in an IPv4 subnet is considered the directed
broadcast address, e.g. for 192.168.3.0/24 the directed broadcast
address is 192.168.3.255. We need to consider this address as
belonging to the interface.
Here's an example with this fix applied, SerenityOS has 192.168.3.190:
[gunnar@nyx ~]$ ping -b 192.168.3.255
WARNING: pinging broadcast address
PING 192.168.3.255 (192.168.3.255) 56(84) bytes of data.
64 bytes from 192.168.3.175: icmp_seq=1 ttl=64 time=0.950 ms
64 bytes from 192.168.3.188: icmp_seq=1 ttl=64 time=2.33 ms
64 bytes from 192.168.3.46: icmp_seq=1 ttl=64 time=2.77 ms
64 bytes from 192.168.3.41: icmp_seq=1 ttl=64 time=4.15 ms
64 bytes from 192.168.3.190: icmp_seq=1 ttl=64 time=29.4 ms
64 bytes from 192.168.3.42: icmp_seq=1 ttl=64 time=30.8 ms
64 bytes from 192.168.3.55: icmp_seq=1 ttl=64 time=31.0 ms
64 bytes from 192.168.3.30: icmp_seq=1 ttl=64 time=33.2 ms
64 bytes from 192.168.3.31: icmp_seq=1 ttl=64 time=33.2 ms
64 bytes from 192.168.3.173: icmp_seq=1 ttl=64 time=41.7 ms
64 bytes from 192.168.3.43: icmp_seq=1 ttl=64 time=47.7 ms
^C
--- 192.168.3.255 ping statistics ---
1 packets transmitted, 1 received, +10 duplicates, 0% packet loss,
time 0ms, rtt min/avg/max/mdev = 0.950/23.376/47.676/16.539 ms
[gunnar@nyx ~]$
This turns the perfcore format into more a log than it was before,
which lets us properly log process, thread and region
creation/destruction. This also makes it unnecessary to dump the
process' regions every time it is scheduled like we did before.
Incidentally this also fixes 'profile -c' because we previously ended
up incorrectly dumping the parent's region map into the profile data.
Log-based mmap support enables profiling shared libraries which
are loaded at runtime, e.g. via dlopen().
This enables profiling both the parent and child process for
programs which use execve(). Previously we'd discard the profiling
data for the old process.
The Profiler tool has been updated to not treat thread IDs as
process IDs anymore. This enables support for processes with more
than one thread. Also, there's a new widget to filter which
process should be displayed.
The previous `LOCKER(..)` instrumentation only covered some of the
cases where a lock is actually acquired. By utilizing the new
`AK::SourceLocation` functionality we can now reliably instrument
all calls to lock automatically.
Other changes:
- Tweak the message in `Thread::finalize()` which dumps leaked lock
so it's more readable and includes the function information that is
now available.
- Make the `LOCKER(..)` define a no-op, it will be cleaned up in a
follow up change.
- UBSAN detected cases where we were calling thread->holding_lock(..)
but current_thread was nullptr.
- Fix Lock::force_unlock_if_locked to not pass the correct ref delta to
holding_lock(..).
Userspace can provide a null argument for the `act` argument to the
`sigaction` syscall to not set any new behavior. This is described
here:
https://pubs.opengroup.org/onlinepubs/007904875/functions/sigaction.html
Without this fix, the `copy_from_user(...)` invocation on `user_act`
fails and makes the syscall return early.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
The previous implementation could allocate on insertion into the completed / pending
sub request vectors. There's no reason these can't be intrusive lists instead.
This is a very minor step towards improving the ability to handle OOM, as tracked by #6369
It might also help improve performance on the IO path in certain situations.
I'll benchmark that later.
Until we get the goodness that C++ modules are supposed to be, let's try
to shave off some parse time using precompiled headers.
This commit only adds some very common AK headers, only to binaries,
libraries and the kernel (tests are not covered due to incompatibility
with AK/TestSuite.h).
This option is on by default, but can be disabled by passing
`-DPRECOMPILE_COMMON_HEADERS=OFF` to cmake, which will disable all
header precompilations.
This makes the build about 30 seconds faster on my machine (about 7%).
Binding to port 0 is used to signal to listen() to bind to any port
that is available. (in serenity's case, to the port range of 32768 to
60999, which are not privileged ports)
While profiling all processes the profile buffer lives forever.
Once you have copied the profile to disk, there's no need to keep it
in memory. This syscall surfaces the ability to clear that buffer.
This command line flag can be used to disable VirtIO support on
certain configurations (native windows) where interfacing with
virtio devices can cause qemu to freeze.
Pressing this combo will dump a list of all threads and their state
to the debug console.
This might be useful to figure out why the system is not responding.
This adds PT_PEEKDEBUG and PT_POKEDEBUG to allow for reading/writing
the debug registers, and updates the Kernel's debug handler to read the
new information from the debug status register.
Helps with bare metal debugging, as we can't be sure our implementation
will work with a given machine.
As reported by someone on Discord, their machine hangs when we attempt
the dummy transfer.