Previously TCPSocket::send_tcp_packet() would try to send TCP packets
which matched whatever size the userspace program specified. We'd try to
break those packets up into smaller fragments, however a much better
approach is to limit TCP packets to the maximum segment size and
avoid fragmentation altogether.
Since `s_mm_lock` is a RecursiveSpinlock, if a kernel thread gets
preempted while accidentally hold the lock during switch_context,
another thread running on the same processor could end up manipulating
the state of the memory manager even though they should not be able to.
It will just bump the recursion count and keep going.
This appears to be the root cause of weird bugs like: #7359
Where page protection magically appears to be wrong during execution.
To avoid these cases lets guard this specific unfortunate case and make
sure it can never go unnoticed ever again.
The assert was Tom's idea to help debug this, so I am going to tag him
as co-author of this commit.
Co-Authored-By: Tom <tomut@yahoo.com>
The Alternate Screen Buffer is used by full-screen terminal applications
(like `vim` and `nano`). Its data is stored separately from the normal
buffer, therefore after applications using it exit, everything looks
like it was before, the bottom of their interfaces isn't visible. An
interesting feature is that it does not support scrollback, so it
consumes less memory by not having to allocate lines for history.
Because of the need to save and restore state between the switches, some
correctness issues relating to it were also fixed in this commit.
This commit introduces support for 3 new escape sequences:
1. Stop blinking cursor mode
2. `DECTCEM` mode (enable/disable cursor)
3. `DECSCUSR` (set cursor style)
`TerminalWidget` now supports the following cursor types: block,
underline and vertical bar. Each of these can blink or be steady.
`VirtualConsole` ignores these (just as we were doing before).
I introduced a regression in #7184 where `TTY` would report 1 byte read
in canonical mode even if we had no more characters left. This was
caused by counting the '\0' that denotes EOF into the number of
characters that were read.
The fix was simple: exclude the EOF character from the number of bytes.
This still wouldn't be correct by itself, as the EOF and EOL control
characters could change between when the data was written to the TTY and
when it is read. We fix this by signaling out-of-band whether something
is a special character. End-of-file markers have a value of zero and
have their special bits set. Any other bytes with a special flag are
treated as line endings. This is possible, as POSIX doesn't allow
special characters to be 0.
Fixes#7419
Previously the process' m_profiling flag was ignored for all event
types other than CPU samples.
The kfree tracing code relies on temporarily disabling tracing during
exec. This didn't work for per-process profiles and would instead
panic.
This updates the profiling code so that the m_profiling flag isn't
ignored.
There's no good reason to distinguish between network interfaces based
on their model. It's probably a good idea to try keep the names more
persistent so scripts written for a specific network interface will be
useable after hotplug event (or after rebooting with new hardware
setup).
This is by default left empty, so people won't run the kernel in a mode
which they didn't want to. The embedded string will override the
supplied commandline from the bootloader, which is good for debugging
sessions.
This change seemed important for me, because I debug the kernel on bare
metal with iPXE, and every change to the commandline meant that I needed
rewrite a new iPXE USB image with a modified iPXE script.
First scan PCI bus 0. Find any device on that bus, and if it's a
PCI-to-PCI bridge, recursively scan it too.
Then try to handle Multiple PCI host bridges on slot 0, device 0.
If we happen to miss some PCI buses because they are not reachable
through recursive PCI-to-PCI bridges scanning starting from bus 0, we
might find them in this scanning.
This reverts commit e95eb7a51d.
This is causing some sort of list corruption, as evident by #7313
I haven't been able to figure it out yet, so lets revert this change
until I can figure out what's going on.
When mmaping a Framebuffer from userspace, we need to check whether the
framebuffer device is actually enabled (e.g. graphical mode is being
used) or a textual VirtualConsole is active.
Considering the above state, we mmap the right VMObject to ensure we
don't have graphical artifacts if we change the resolution from
DisplaySettings, changed to textual mode and after the resolution change
was reverted, we will see the Desktop reappearing even though we are
still in textual mode.
If we tried to change the resolution before of this patch, we triggered
a kernel crash due to mmaping the framebuffer device again.
Therefore, on mmaping of the framebuffer device, we create an entire new
set of VMObjects and Regions for the new settings.
Then, when we change the resolution, the framebuffersconsole needs to be
updated with the new resolution and also to be refreshed with the new
settings. To ensure we handle both shrinking of the resolution and
growth of it, we only copy the right amount of available data from the
cells Region.
Instead of processing the input after receiving an IRQ, we shift the
responsibility to the io work queue to handle this for us, so if a page
fault occurs when trying to switch the VirtualConsole, the kernel can
handle that.
I introduced this bug in e95eb7a51, where it's possible that the
ProcessGroup is created, but we never add it to the list. Make sure we
check that we are in the list before removal. This only broke booting in
self-test mode oddly enough.
Reported-By: Andrew Kaster <andrewdkaster@gmail.com>
Avoid allocating while holding the g_process_groups_lock spinlock, it's
a pattern that has a negative effect on performance and scalability,
especially given that it is a global lock, reachable by all processes.
Currently in SMP mode we hard code support for up to only 8 processors.
There is no reason for this to be a dynamic allocation that needs to be
guarded by a spinlock. Instead use a Array<T* with inline storage of 8,
allowing each processor to initialize it self in place, avoiding all
the need for locks.
Spinlocks guard short regions, with hopefully no other locks being taken
in the process. Violating constraints usually had detrimental effects on
platform stability as well as performance and scalability. Allocating
memory takes it own locks, and can in some cases even allocate new
regions, and thus violates these tenants.
Move the AnonymousVMObject creation outside of the spinlock as
creation does not modify any shared state.
Hook the kernel page fault handler and capture page fault events when
the fault has a current thread attached in TLS. We capture the eip and
ebp so we can unwind the stack and locate which pieces of code are
generating the most page faults.
Co-authored-by: Gunnar Beutner <gbeutner@serenityos.org>
Problem:
- `static` variables consume memory and sometimes are less
optimizable.
- `static const` variables can be `constexpr`, usually.
- `static` function-local variables require an initialization check
every time the function is run.
Solution:
- If a global `static` variable is only used in a single function then
move it into the function and make it non-`static` and `constexpr`.
- Make all global `static` variables `constexpr` instead of `const`.
- Change function-local `static const[expr]` variables to be just
`constexpr`.
The WorkQueue class previously had its own inline storage functionality
for function pointers. With the recent changes to the Function class
this is no longer necessary.
This modifies the error checks in VFS::open after the call to
resolve_path to ignore a null parent custody if there is no error, as
this is expected when the path to resolve points to "/". Rather, a null
parent custody only constitutes an error if it is accompanied by ENOENT.
This behavior is documented in the VFS::resolve_path_without_veil
method.
To accompany this change, the order of the error checks have been
changed to more naturally fit the new logic.
Fixes a bug where TTY::write will attempt to write into the underlying
device but will not acknowledge the result of that write, instead
assuming that the write fully completed.
There is a slight race condition in our implementation of write().
We call File::can_write() before attempting to write to it (blocking if
it returns false). If it returns true, we assume that we can write to
the file, and our code assumes that File::write() cannot possibly fail
by being blocked. There is, however, the rare case where another process
writes to the file and prevents further writes in between the call to
Files::can_write() and File::write() in the first process. This would
result in the first process calling File::write() when it cannot be
written to.
We fix this by adding a mechanism for File::can_write() to signal that
it was blocked, making it the responsibilty of File::write() to check
whether it can write and then finally making sys$write() check if the
write failed due to it being blocked.
This commit adds support for initializing multiple serial ports per
PCI board, as well as initializing multiple different pci serial boards
Currently we just choose the first PCI serial port seen as the debug
port, but this should probably be made configurable some how in the
future.
This avoids two allocations when receiving network packets. One for
inserting a PacketWithTimestamp into m_packet_queue and another one
when inserting buffers into the list of unused packet buffers.
With this fixed the only allocations in NetworkTask happen when
initially allocating the PacketWithTimestamp structs and when switching
contexts.
Problem:
- `BitmapView` permits changing the underlying `Bitmap`. This violates
the idea of a "view" since views are simply overlays which can
themselves change but do not change the underlying data.
Solution:
- Migrate all non-`const` member functions to Bitmap.
When profiling a single process we didn't disable the profile timer.
enable_profile_timer()/disable_profiler_timer() support nested calls
so no special care has to be taken here to only disable the timer when
nobody else is using it.
These functions should return success when being called when profiling
has been requested from multiple callers because enabling/disabling the
timer is a no-op in that case and thus didn't fail.
This had very bad interactions with ccache, often leading to rebuilds
with 100% cache misses, etc. Ali says it wasn't that big of a speedup
in the end anyway, so let's not bother with it.
We can always bring it back in the future if it seems like a good idea.
This fixes non-periodic comparators not receiving interrupts, as we
were never setting the InterruptEnable bit in their capabilities
register (unlike periodic comparators's bit, which was set as a side
effect of calling set_periodic on them to set their periodic bit).
This should help getting profiling work on bare-metal SerenityOS
installations, which were not guaranteed to have 2 periodic
comparators available.
Before this commit, we would jump to the first column after receiving
the '\n' line feed character. This is not the correct behavior, as it
should only move the cursor now. Translating the typed Return key into
the correct CR LF ("\r\n") is the TTY's job, which was fixed in #7184.
Fixes#6820Fixes#6960
Problem:
- `BitmapView` permits changing the underlying `Bitmap`. This violates
the idea of a "view" since views are simply overlays which can
themselves change but do not change the underlying data.
Solution:
- Migrate all non-`const` member functions to Bitmap.
This simple driver simply finds a device in a device definitions list
and then sets up a SerialDevice instance based on the definition.
The driver currently only supports "WCH CH382 2S" pci serial boards,
as that is the only device available for me to test with, but most
other pci serial devices should be as easily addable as adding a
board_definitions entry.
The line control option bits (parity, stop bits, word length) were
masked and then combined incorrectly, resulting in them not being set
when requested.
These were accidentally the wrong way around (LSB part of the divisor
into the MSB register, MSB part of the divisor into the LSB register)
as can be seen in the specification (and in the comments themselves)
Unlike accept() the new accept4() system call lets the caller specify
flags for the newly accepted socket file descriptor, such as
SOCK_CLOEXEC and SOCK_NONBLOCK.
This commit adds support for the various ECHO* lflags and fixes some
POSIX conformance issues around newline handling. Also included are
error messages when setting not implemented settings.
Because we don't parse ACPI AML yet, If we are not able to shut down
the machine with "hacky" emulation methods - halt and print this state
to the users so they know they can shutdown the machine by themselves.
This fixes a bug that was reported on this discord server by
@ElectrodeYT - due to the confusion of passing arguments in different
orders, we messed up and triggered a page fault due to faulty sizes.
If we create a VGACompatibleAdapter object with a preset framebuffer,
Always assign the console so we can use it.
This is useful for modesetting done by a Multiboot loader, like GRUB.
As we removed the support of VBE modesetting that was done by GRUB early
on boot, we need to determine if we can modeset the resolution with our
drivers, and if not, we should enable text mode and ensure that
SystemServer knows about it too.
Also, SystemServer should first check if there's a framebuffer device
node, which is an indication that text mode was not even if it was
requested. Then, if it doesn't find it, it should check what boot_mode
argument the user specified (in case it's self-test). This way if we
try to use bochs-display device (which is not VGA compatible) and
request a text mode, it will not honor the request and will continue
with graphical mode.
Also try to print critical messages with mininum memory allocations
possible.
In LibVT, We make the implementation flexible for kernel-specific
methods that are implemented in ConsoleImpl class.
We used GRUB to modeset the resolution for a long time, but for good
reasons I see no point with keeping it supported in our kernel. We
support bochs-display device on QEMU (both the VGA compatible and
non-VGA compatible variants), so for QEMU we can still boot the system
in graphical mode even without GRUB help.
Also, we now have a native driver for Intel graphics and although it
doesn't support most Intel graphics cards out there yet, it's a good
starting point to support more cards. If a user wants to boot on
bare-metal in graphical mode, all he needs to do is to add the removed
flag back again, as the kernel still supports pre-set framebuffers.
This new subsystem is replacing the old code that was used to
create device nodes of framebuffer devices in /dev.
This subsystem includes for now 3 roles:
1. GraphicsManagement singleton object that is used in the boot
process to enumerate and initialize display devices.
2. GraphicsDevice(s) that are used to control the display adapter.
3. FramebufferDevice(s) that are used to control the device node in
/dev.
For now, we support the Bochs display adapter and any other
generic VGA compatible adapter that was configured by the boot
loader to a known and fixed resolution.
Two improvements in the Bochs display adapter code are that
we can support native bochs-display device (this device doesn't
expose any VGA capabilities) and also that we use the MMIO region,
to configure the device, instead of setting IO ports for such tasks.
This device is a graphics display device that is not supporting
VGA functionality.
Therefore, it exposes a MMIO region to configure it, so we use that
region to set the framebuffer resolution.
Previously ByteBuffer would internally hold a RefPtr to the byte
buffer and would behave like a reference type, i.e. copying a
ByteBuffer would not create a duplicate byte buffer, but rather
two objects which refer to the same internal buffer.
This also changes ByteBuffer so that it has some internal capacity
much like the Vector<T> type. Unlike Vector<T> however a byte
buffer's data may be uninitialized.
With this commit ByteBuffer makes use of the kmalloc_good_size()
API to pick an optimal allocation size for its internal buffer.
This non-POSIX header is used in Linux/BSD systems for storing the
default termios settings. This lets us setup new TTYs' `m_termios.c_cc`
in a nicer way than using a magic string.
This commit replaces the former, hand-written parser with a new one that
can be generated automatically according to a state change diagram.
The new `EscapeSequenceParser` class provides a more ergonomic interface
to dealing with escape sequences. This interface has been inspired by
Alacritty's [vte library](https://github.com/alacritty/vte/).
I tried to avoid changing the application logic inside the `Terminal`
class. While this code has not been thoroughly tested, I can't find
regressions in the basic command line utilities or `vttest`.
`Terminal` now displays nicer debug messages when it encounters an
unknown escape sequence. Defensive programming and bounds checks have
been added where we access parameters, and as a result, we can now
endure 4-5 seconds of `cat /dev/urandom`. :D
We generate EscapeSequenceStateMachine.h when building the in-kernel
LibVT, and we assume that the file is already in place when the userland
library is being built. This will probably cause problems later on, but
I can't find a way to do it nicely.
By constraining two implementations, the compiler will select the best
fitting one. All this will require is duplicating the implementation and
simplifying for the `void` case.
This constraining also informs both the caller and compiler by passing
the callback parameter types as part of the constraint
(e.g.: `IterationFunction<int>`).
Some `for_each` functions in LibELF only take functions which return
`void`. This is a minimal correctness check, as it removes one way for a
function to incompletely do something.
There seems to be a possible idiom where inside a lambda, a `return;` is
the same as `continue;` in a for-loop.
Regressed in 8a4cc735b9.
We stopped generating "process created" when enabling profiling,
which led to Profiler getting confused about the missing events.
Fixes off-by-one caused by reading the register directly
without adding a 1 to it, because AHCI reports 1 less port than
the actual number of ports supported.
This implements the macOS API malloc_good_size() which returns the
true allocation size for a given requested allocation size. This
allows us to make use of all the available memory in a malloc chunk.
For example, for a malloc request of 35 bytes our malloc would
internally use a chunk of size 64, however the remaining 29 bytes
would be unused.
Knowing the true allocation size allows us to request more usable
memory that would otherwise be wasted and make that available for
Vector, HashTable and potentially other callers in the future.
Previously calls to perf_event() would end up in a process-specific
perfcore file even though global profiling was enabled. This changes
the behavior for perf_event() so that these events are stored into
the global profile instead.
On my machine, it only sets PRC and not PCC.
Confirmed to happen on:
- 8086:9ca2 (Intel Corporation Wildcat Point-LP SATA Controller
[AHCI Mode] (rev 03))
On my bare metal machine, enabling it as this point causes it to
instantly send an interrupt, and we're too early in the process
to be able to handle AHCI interrupts. The interrupts were being
enabled in the initialize function anyway.
Confirmed to happen on:
- 8086:9ca2 (Intel Corporation Wildcat Point-LP SATA Controller
[AHCI Mode] (rev 03))
- 8086:3b22 (Intel Corporation 5 Series/3400 Series Chipset
6 port SATA AHCI Controller (rev 06))
Occasionally we'll see messages in the serial console like:
handle_tcp: unexpected flags in FinWait1 state
In these cases it would be nice to know what flags we are receiving that
we aren't expecting.
The following KUBSAN crash on startup was reported on discord:
```
UHCI: Started
KUBSAN: reference binding to null pointer of type struct UHCIController
KUBSAN: at ../../Kernel/Devices/USB/UHCIController.cpp, line 67
```
After inspecting the code, it became clear that there's a window of time
where the kernel task which monitors the UHCI port can startup and start
executing before the UHCIController constructor completes. This leaves
the singleton pointing to nullptr, thus in the duration of this race
window the "UHCI port proc" thread will go an and de-reference the null
pointer when trying to read for status changes on the UHCI root ports.
Reported-by: @stelar7
Reported-by: @bcoles
Fixes: #6154
AnonymousVMObject::create_with_physical_page(s) can't be NonnullRefPtr
as it allocates internally. Fixing the API then surfaced an issue in
ScatterGatherList, where the code was attempting to create an
AnonymousVMObject in the constructor which will not be observable
during OOM.
Fix all of these issues and start propagating errors at the callers
of the AnonymousVMObject and ScatterGatherList APis.
This change looks more involved than it actually is. This simply
reshuffles the previous Process constructor and splits out the
parts which can fail (resource allocation) into separate methods
which can be called from a factory method. The factory is then
used everywhere instead of the constructor.
This code was unlocking the lock directly, but the Locker is still
attached, causing the lock to be unlocked an extra time, hence
corrupting the internal lock state.
This is extra confusing though, as complete_current_request() runs
without a lock which also looks like a bug. But that's a task for
another day.
The separate backtrace that the PANIC emits isn't useful and just adds
more data to visually filter from the output when debugging an issue.
Instead of calling PANIC, just emit the message with dbgln() and
manually halt the system.
If we are attempting to emit debugging information about an unhandleable
page fault, don't crash trying to kill threads or dump processes if the
current_thread isn't set in TLS. Attempt to keep proceeding in order to
dump as much useful information as possible.
Related: #6948
We were using ELF::Image::section(0) to indicate the "undefined"
section, when what we really wanted was just Optional<Section>.
So let's use Optional instead. :^)
The function fstatat can do the same thing as the stat and lstat
functions. However, it can be passed the file descriptor of a directory
which will be used when as the starting point for relative paths. This
is contrary to stat and lstat which use the current working directory as
the starting for relative paths.
Previously we didn't retransmit lost TCP packets which would cause
connections to hang if packets were lost. Also we now time out
TCP connections after a number of retransmission attempts.
This wakes up NetworkTask every 500 milliseconds so that it can send
pending delayed TCP ACKs and isn't forced to send all of them early
when it goes to sleep like it did before.
When establishing the connection we should send ACKs right away so
as to not delay the connection process. This didn't previously
matter because we'd flush all delayed ACKs when NetworkTask waits
for incoming packets.
Ideally we would never allocate under a spinlock, as it has many
performance and potentially functionality (deadlock) pitfalls.
We violate that rule in many places today, but we need a tool to track
them all down and fix them. This change introduces a new macro option
named `KMALLOC_VERIFY_NO_SPINLOCK_HELD` which can catch these
situations at runtime via an assert.
Avoid holding the sockets_by_tuple lock while allocating the TCPSocket.
While checking if the list contains the item we can also hold the lock
in shared mode, as we are only reading the hash table.
In addition the call to from_tuple appears to be superfluous, as we
created the socket, so we should be able to just return it directly.
This avoids the recursive lock acquisition, as well as the unnecessary
hash table lookups.
This ensures that the lost_samples field is set to zero for the
first sample. We didn't lose any samples before the first sample
so this is the correct value. Without this Profiler gets confused
and draws the graph for the process which contains the first CPU
sample incorrectly (all zeroes usually).
We can lose profiling timer events for a few reasons, for example
disabled interrupts or system slowness. This accounts for lost
time between CPU samples by adding a field lost_samples to each
profiling event which tracks how many samples were lost immediately
preceding the event.
This updates the profiling subsystem to use a separate timer to
trigger CPU sampling. This timer has a higher resolution (1000Hz)
and is independent from the scheduler. At a later time the
resolution could even be made configurable with an argument for
sys$profiling_enable() - but not today.
Make messages which should be fatal, actually fail the build.
- FATAL is not a valid mode keyword. The full list is available in the
docs: https://cmake.org/cmake/help/v3.19/command/message.html
- SEND_ERROR doesn't immediately stop processing, FATAL_ERROR does.
We should immediately stop if the Toolchain is not present.
- The app icon size validation was just a WARNING that is easy to
overlook. We should promote it to a FATAL_ERROR so that people will
not overlook the issue when adding a new application. We can only make
the small icon message FATAL_ERROR, as there is currently one
violation of the medium app icon validation.
Note that the changes to IPv4Socket::create are unfortunately needed as
the return type of TCPSocket::create and IPv4Socket::create don't match.
- KResultOr<NonnullRefPtr<TcpSocket>>>
vs
- KResultOr<NonnullRefPtr<Socket>>>
To handle this we are forced to manually decompose the KResultOr<T> and
return the value() and error() separately.
The make<T> factory function allocates internally and immediately
dereferences the pointer, and always returns a NonnullOwnPtr<T> making
it impossible to propagate an error on OOM.
The make<T> factory function allocates internally and immediately
dereferences the pointer, and always returns a NonnullOwnPtr<T> making
it impossible to propagate an error on OOM.
Modify the API so it's possible to propagate error on OOM failure.
NonnullOwnPtr<T> is not appropriate for the ThreadTracer::create() API,
so switch to OwnPtr<T>, use adopt_own_if_nonnull() to handle creation.
Currently, when passing buffers into VirtIOQueues, we use scatter-gather
lists, which contain an internal vector of buffers. This vector is
allocated, filled and the destroy whenever we try to provide buffers
into a virtqueue, which would happen a lot in performance cricital code
(the main transport mechanism for certain paravirtualized devices).
This commit moves it over to using VirtIOQueueChains and building the
chain in place in the VirtIOQueue. Also included are a bunch of fixups
for the VirtIO Console device, making it use an internal VM::RingBuffer
instead.
We want to move this out of the AHCI subsystem into the VM system,
since other parts of the kernel may need to perform scatter-gather IO.
We rename the current VM::ScatterGatherList impl that's used in the
virtio subsystem to VM::ScatterGatherRefList, since its distinguishing
feature from the AHCI scatter-gather list is that it doesn't own its
buffers.
For Kernel OOM hardening to work correctly, we need to be able to
call a "nothrow" version of operator new. Unfortunately the default
"throwing" version of operator new assumes that the allocation will
never return on failure and will always throw an exception. This isn't
true in the Kernel, as we don't have exceptions. So if we call the
normal/throwing new and kmalloc returns NULL, the generated code will
happily go and dereference that NULL pointer by invoking the constructor
before we have a chance to handle the failure.
To fix this we declare operator new as noexcept in the Kernel headers,
which will allow the caller to actually handle allocation failure.
The delete implementations need to match the prototype of the new which
allocated them, so we need define delete as noexcept as well. GCC then
errors out declaring that you should implement sized delete as well, so
this change provides those stubs in order to compile cleanly.
Finally the new operator definitions have been standardized as being
declared with [[nodiscard]] to avoid potential memory leaks. So lets
declares the kernel versions that way as well.
This avoids some of the the shortest-lived allocations in the kernel:
StringImpl::create_uninitialized(unsigned long, char*&)
StringImpl::create(char const*, unsigned long, ShouldChomp)
StringBuilder::to_string() const
String::vformatted(StringView, TypeErasedFormatParams)
void Kernel::KBufferBuilder::appendff<unsigned int>(...)
JsonObjectSerializer<Kernel::KBufferBuilder>::add(..., unsigned int)
Kernel::procfs$all(Kernel::InodeIdentifier, ...) const
Kernel::procfs$all(Kernel::InodeIdentifier, Kernel::KBufferBuilder&)
This avoids allocations for initializing the Function<T>
for the NetworkAdapter::for_each callback argument.
Applying this patch decreases CPU utilization for NetworkTask
from 40% to 28% when receiving TCP packets at a rate of 100Mbit/s.
We already have another limit for the total number of packet buffers
allowed (max_packet_buffers). This second limit caused us to
repeatedly allocate and then free buffers.
This patch modifies InodeWatcher to switch to a one watcher, multiple
watches architecture. The following changes have been made:
- The watch_file syscall is removed, and in its place the
create_iwatcher, iwatcher_add_watch and iwatcher_remove_watch calls
have been added.
- InodeWatcher now holds multiple WatchDescriptions for each file that
is being watched.
- The InodeWatcher file descriptor can be read from to receive events on
all watched files.
Co-authored-by: Gunnar Beutner <gunnar@beutner.name>
If the HPET main counter does not support full 64 bits, we should
not expect the upper 32 bit to work. This is a problem when writing
to the upper 32 bit of the comparator value, which requires the
TimerConfiguration::ValueSet bit to be set, but if it's not 64 bit
capable then the bit will not be cleared and leave it in a bad state.
Fixes#6990
This matches what other operating systems like Linux do:
$ ip route get 0.0.0.0
local 0.0.0.0 dev lo src 127.0.0.1 uid 1000
cache <local>
$ ssh 0.0.0.0
gunnar@0.0.0.0's password:
$ ss -na | grep :22 | grep ESTAB
tcp ESTAB 0 0 127.0.0.1:43118 127.0.0.1:22
tcp ESTAB 0 0 127.0.0.1:22 127.0.0.1:43118
When we receive a TCP packet with a sequence number that is not what
we expected we have lost one or more packets. We can signal this to
the sender by sending a TCP ACK with the previous ack number so that
they can resend the missing TCP fragments.
Previously we'd process TCP packets in whatever order we received
them in. In the case where packets arrived out of order we'd end
up passing garbage to the userspace process.
This was most evident for TLS connections:
courage:~ $ git clone https://github.com/SerenityOS/serenity
Cloning into 'serenity'...
remote: Enumerating objects: 178826, done.
remote: Counting objects: 100% (1880/1880), done.
remote: Compressing objects: 100% (907/907), done.
error: RPC failed; curl 56 OpenSSL SSL_read: error:1408F119:SSL
routines:SSL3_GET_RECORD:decryption failed or bad record mac, errno 0
error: 1918 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
When the MSS option header is missing the default maximum segment
size is 536 which results in lots of very small TCP packets that
NetworkTask has to handle.
This adds the MSS option header to outbound TCP SYN packets and
sets it to an appropriate value depending on the interface's MTU.
Note that we do not currently do path MTU discovery so this could
cause problems when hops don't fragment packets properly.
This increases the default TCP window size to a more reasonable
value of 64k. This allows TCP peers to send us more packets before
waiting for corresponding ACKs.
This increases the buffer size for connection-oriented sockets
to 256kB. In combination with the other patches in this series
I was able to receive TCP packets at a rate of about 120Mbps.
The get_dir_entries syscall failed if the serialized form of all the
directory entries together was too large to fit in its temporary buffer.
Now the kernel uses a fixed size buffer, that is flushed to an output
buffer when it is full. If this flushing operation fails because there
is not enough space available, the syscall will return -EINVAL. That
error code is then used in userspace as a signal to allocate a larger
buffer and retry the syscall.
Previously we'd try to load ELF images which did not have
an interpreter set with an incorrect load offset of 0, i.e. way
outside of the part of the address space where we'd expect either
the dynamic loader or the user's executable to reside.
This fixes the problem by using get_load_offset for both executables
which have an interpreter set and those which don't. Notably this
allows us to actually successfully execute the Loader.so binary:
courage:~ $ /usr/lib/Loader.so
You have invoked `Loader.so'. This is the helper program for programs
that use shared libraries. Special directives embedded in executables
tell the kernel to load this program.
This helper program loads the shared libraries needed by the program,
prepares the program to run, and runs it. You do not need to invoke
this helper program directly.
courage:~ $
Previously we'd incorrectly use the default gateway's MAC address.
Instead we must use destination MAC addresses that are derived from
the multicast IPv4 address.
With this patch applied I can query mDNS on a real network.
The Kernel/.gitignore file is a remnant of the prior build system,
where the kernel.map was written directly to to the Kernel folder.
The run.sh was also under Kernel so pcap files and others would get
dropped there when running the system under qemu.
None of these situations are possible now, so lets get rid of it.
Instead of reading in the entire contents of a directory into a large
buffer, we can iterate block by block. This only requires a small
buffer.
Because directory entries are guaranteed to never span multiple blocks
we do not have to handle any edge cases related to that.
On some cases, the FADT could be on the end of a page, so if we don't
have two pages being mapped, we could easily read from a non-mapped
virtual address, which will trigger the UB sanitizer.
Also, we need to treat the FADT structure as volatile and const, as it
may change at any time, but we should not touch (write) it anyhow.
In VFS::rename, if new_path is equal to '/', then, parent custody is
set to null.
VFS::rename would then use parent custody without checking it first.
Fixed VFS::rename to check both old and new path parent custody
before actually using them.
write_bytes is called with a count of 0 bytes if a directory is being
deleted, because in that case even the . and .. pseudo directories are
getting removed. In this case write_bytes is now a no-op.
Before write_bytes would fail because it would check to see if there
were any blocks available to write in (even though it wasn't going to
write in them anyway).
This behaviour was uncovered because of a recent change where
directories are correctly reduced in size. Which in this case results in
all the blocks being removed from the inode, whereas previously there
would be some stale blocks around to pass the check.
e2fsck considers all blocks reachable through any of the pointers in
m_raw_inode.i_block as part of this inode regardless of the value in
m_raw_inode.i_size. When it finds more blocks than the amount that
is indicated by i_size or i_blocks it offers to repair the filesystem
by changing those values. That will actually cause further corruption.
So we must zero all pointers to blocks that are now unused.
The current method of emitting performance events requires a bit of
boiler plate at every invocation, as well as having to ignore the
return code which isn't used outside of the perf event syscall. This
change attempts to clean that up by exposing high level API's that
can be used around the code base.
Ext2 directory contents are stored in a linked list of ext2_dir_entry
structs. There is no sentinel value to determine where the list ends.
Instead the list fills the entirety of the allocated space for the
inode.
Previously the inode was not correctly resized when it became smaller.
This resulted in stale data being interpreted as part of the linked list
of directory entries.
When reading UDP packets from userspace with recvmsg()/recv() we
would hit a VERIFY() if the supplied buffer is smaller than the
received UDP packet. Instead we should just return truncated data
to the caller.
This can be reproduced with:
$ dd if=/dev/zero bs=1k count=1 | nc -u 192.168.3.190 68
We use a global setting to determine if Caps Lock should be remapped to
Control because we don't care how keyboard events come in, just that they
should be massaged into different scan codes.
The `proc` filesystem is able to manipulate this global variable using
the `sysctl` utility like so:
```
# sysctl caps_lock_to_ctrl=1
```
An IP socket can now join a multicast group by using the
IP_ADD_MEMBERSHIP sockopt, which will cause it to start receiving
packets sent to the multicast address, even though this address does
not belong to this host.
When using `sysctl` you can enable/disable values by writing to the
ProcFS. Some drift must have occured where writing was failing due to
a missing `set_mtime` call. Whenever one `write`'s a file the modified
time (mtime) will be updated so we need to implement this interface in
ProcFS.