Before putting itself back on the wait queue, the finalizer task will
now check if there's more work to do, and if so, do it first. :^)
This patch also puts a bunch of process/thread debug logging behind
PROCESS_DEBUG and THREAD_DEBUG since it was unbearable to debug this
stuff with all the spam.
Move timeout management to the ReadBlocker and WriteBlocker classes.
Also get rid of the specialized ReceiveBlocker since it no longer does
anything that ReadBlocker can't do.
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.
For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.
Going forward, all new source files should include a license header.
It was quite easy to put the system into a heavy churn state by doing
e.g "cat /dev/zero".
It was then basically impossible to kill the "cat" process, even with
"kill -9", since signals are only delivered in two conditions:
a) The target thread is blocked in the kernel
b) The target thread is running in userspace
However, since "cat /dev/zero" command spends most of its time actively
running in the kernel, not blocked, the signal dispatch code just kept
postponing actually handling the signal indefinitely.
To fix this, we now check before returning from a syscall if there are
any pending unmasked signals, and if so, we take a dramatic pause by
blocking the current thread, knowing it will immediately be unblocked
by signal dispatch anyway. :^)
This fixes a null RefPtr deref (which asserts) in the scheduler if a
file descriptor being select()'ed is closed by a second thread while
blocked in select().
Test: Kernel/null-deref-close-during-select.cpp
All threads were running with iomapbase=0 in their TSS, which the CPU
interprets as "there's an I/O permission bitmap starting at offset 0
into my TSS".
Because of that, any bits that were 1 inside the TSS would allow the
thread to execute I/O instructions on the port with that bit index.
Fix this by always setting the iomapbase to sizeof(TSS32), and also
setting the TSS descriptor's limit to sizeof(TSS32), effectively making
the I/O permissions bitmap zero-length.
This should make it no longer possible to do I/O from userspace. :^)
Threads now have numeric priorities with a base priority in the 1-99
range.
Whenever a runnable thread is *not* scheduled, its effective priority
is incremented by 1. This is tracked in Thread::m_extra_priority.
The effective priority of a thread is m_priority + m_extra_priority.
When a runnable thread *is* scheduled, its m_extra_priority is reset to
zero and the effective priority returns to base.
This means that lower-priority threads will always eventually get
scheduled to run, once its effective priority becomes high enough to
exceed the base priority of threads "above" it.
The previous values for ThreadPriority (Low, Normal and High) are now
replaced as follows:
Low -> 10
Normal -> 30
High -> 50
In other words, it will take 20 ticks for a "Low" priority thread to
get to "Normal" effective priority, and another 20 to reach "High".
This is not perfect, and I've used some quite naive data structures,
but I think the mechanism will allow us to build various new and
interesting optimizations, and we can figure out better data structures
later on. :^)
PR #591 defines the rationale for kernel-level timers. They're most
immediately useful for TCP retransmission, but will most likely see use
in many other areas as well.
This patch introduces three separate thread queues, one for each thread
priority available to userspace (Low, Normal and High.)
Each queue operates in a round-robin fashion, but we now always prefer
to schedule the highest priority thread that currently wants to run.
There are tons of tweaks and improvements that we can and should make
to this mechanism, but I think this is a step in the right direction.
This makes WindowServer significantly more responsive while one of its
clients is burning CPU. :^)
The idea of all processes reliably having a main thread was nice in
some ways, but cumbersome in others. More importantly, it didn't match
up with POSIX thread semantics, so let's move away from it.
This thread gets rid of Process::main_thread() and you now we just have
a bunch of Thread objects floating around each Process.
When the finalizer nukes the last Thread in a Process, it will also
tear down the Process.
There's a bunch of more things to fix around this, but this is where we
get started :^)
This patch adds a single "kernel info page" that is mappable read-only
by any process and contains the current time of day.
This is then used to implement a version of gettimeofday() that doesn't
have to make a syscall.
To protect against race condition issues, the info page also has a
serial number which is incremented whenever the kernel updates the
contents of the page. Make sure to verify that the serial number is the
same before and after reading the information you want from the page.
The kernel now supports basic profiling of all the threads in a process
by calling profiling_enable(pid_t). You finish the profiling by calling
profiling_disable(pid_t).
This all works by recording thread stacks when the timer interrupt
fires and the current thread is in a process being profiled.
Note that symbolication is deferred until profiling_disable() to avoid
adding more noise than necessary to the profile.
A simple "/bin/profile" command is included here that can be used to
start/stop profiling like so:
$ profile 10 on
... wait ...
$ profile 10 off
After a profile has been recorded, it can be fetched in /proc/profile
There are various limits (or "bugs") on this mechanism at the moment:
- Only one process can be profiled at a time.
- We allocate 8MB for the samples, if you use more space, things will
not work, and probably break a bit.
- Things will probably fall apart if the profiled process dies during
profiling, or while extracing /proc/profile
Instead of using the generic block mechanism, wait-queued threads now
go into the special Queued state.
This fixes an issue where signal dispatch would unblock a wait-queued
thread (because signal dispatch unblocks blocked threads) and cause
confusion since the thread only expected to be awoken by the queue.
The kernel's Lock class now uses a proper wait queue internally instead
of just having everyone wake up regularly to try to acquire the lock.
We also keep the donation mechanism, so that whenever someone tries to
take the lock and fails, that thread donates the remainder of its
timeslice to the current lock holder.
After unlocking a Lock, the unlocking thread calls WaitQueue::wake_one,
which unblocks the next thread in queue.
It's now possible to block until another thread in the same process has
exited. We can also retrieve its exit value, which is whatever value it
passed to pthread_exit(). :^)
Scheduling priority is now set at the thread level instead of at the
process level.
This is a step towards allowing processes to set different priorities
for threads. There's no userspace API for that yet, since only the main
thread's priority is affected by sched_setparam().
Added the exception_code field to RegisterDump, removing the need
for RegisterDumpWithExceptionCode. To accomplish this, I had to
push a dummy exception code during some interrupt entries to properly
pad out the RegisterDump. Note that we also needed to change some code
in sys$sigreturn to deal with the new RegisterDump layout.
If we didn't find anything else that wants to run, we don't need to
update the current thread's TSS since we're just gonna return to the
same thread anyway.
If we receive an IRQ while the idle task is running, prevent it from
re-halting the CPU after the IRQ handler returns.
Instead have the idle task yield to the scheduler, so we can see if
the IRQ has unblocked something.
This patch adds support for TLS according to the x86 System V ABI.
Each thread gets a thread-specific memory region, and the GS segment
register always points _to a pointer_ to the thread-specific memory.
In other words, to access thread-local variables, userspace programs
start by dereferencing the pointer at [gs:0].
The Process keeps a master copy of the TLS segment that new threads
should use, and when a new thread is created, they get a copy of it.
It's basically whatever the PT_TLS program header in the ELF says.
Each Function is a heap allocation, so let's make an effort to avoid
doing that during scheduling. Because of header dependencies, I had to
put the runnables iteration helpers in Thread.h, which is a bit meh but
at least this cuts out all the kmalloc() traffic in pick_next().
With the presence of signal handlers, it is possible that a thread might
be blocked multiple times. Picture for instance a signal handler using
read(), or wait() while the thread is already blocked elsewhere before
the handler is invoked.
To fix this, we turn m_blocker into a chain of handlers. Each block()
call now prepends to the list, and unblocking will only consider the
most recent (first) blocker in the chain.
Fixes#309
The only two places we set m_blocker now are Thread::set_state(), and
Thread::block(). set_state is mostly just an issue of clarity: we don't
want to end up with state() != Blocked with an m_blocker, because that's
weird. It's also possible: if we yield, someone else may set_state() us.
We also now set_state() and set m_blocker under lock in block(), rather
than unlocking which might allow someone else to mess with our internals
while we're in the process of trying to block.
This seems to fix sending STOP & CONT causing a panic.
My guess as to what was happening is this:
thread A blocks in select(): Blocking & m_blocker != nullptr
thread B sends SIGSTOP: Stopped & m_blocker != nullptr
thread B sends SIGCONT: we continue execution. Runnable & m_blocker != nullptr
thread A tries to block in select() again:
* sets m_blocker
* unlocks (in block_helper)
* someone else tries to unblock us? maybe from the old m_blocker? unclear -- clears m_blocker
* sets Blocked (while unlocked!)
So, thread A is left with state Blocked & m_blocker == nullptr, leading
to the scheduler assert (m_blocker != nullptr) failing.
Long story short, let's do all our data management with the lock _held_.
And use this to return EINTR in various places; some of which we were
not handling properly before.
This might expose a few bugs in userspace, but should be more compatible
with other POSIX systems, and is certainly a little cleaner.
And use it in the scheduler.
IntrusiveList is similar to InlineLinkedList, except that rather than
making assertions about the type (and requiring inheritance), it
provides an IntrusiveListNode type that can be used to put an instance
into many different lists at once.
As a proof of concept, port the scheduler over to use it. The only
downside here is that the "list" global needs to know the position of
the IntrusiveListNode member, so we have to position things a little
awkwardly to make that happen. We also move the runnable lists to
Thread, to avoid having to publicize the node.
Committing some things my hands did while browsing through this code.
- Mark all leaf classes "final".
- FileDescriptionBlocker now stores a NonnullRefPtr<FileDescription>.
- FileDescriptionBlocker::blocked_description() now returns a reference.
- ConditionBlocker takes a Function&&.
"Blocking" is not terribly informative, but now that everything is
ported over, we can force the blocker to provide us with a reason.
This does mean that to_string(State) needed to become a member, but
that's OK.
And use dbgprintf() consistently on a few of the pieces of logging here.
This is useful when trying to track thread switching when you don't
really care about what it's switching _to_.
Replace the class-based snooze alarm mechanism with a per-thread callback.
This makes it easy to block the current thread on an arbitrary condition:
void SomeDevice::wait_for_irq() {
m_interrupted = false;
current->block_until([this] { return m_interrupted; });
}
void SomeDevice::handle_irq() {
m_interrupted = true;
}
Use this in the SB16 driver, and in NetworkTask :^)
This makes waitpid() return when a child process is stopped via a signal.
Use this in Shell to catch stopped children and return control to the
command line. :^)
Fixes#298.
It's kinda funny how I can make a mistake like this in Serenity and then
get so used to it by spending lots of time using this API that I start to
believe that this is how printf() always worked..
After reading a bunch of POSIX specs, I've learned that a file descriptor
is the number that refers to a file description, not the description itself.
So this patch renames FileDescriptor to FileDescription, and Process now has
FileDescription* file_description(int fd).
Passing this flag to recv() temporarily puts the file descriptor into
non-blocking mode.
Also implement LocalSocket::recv() as a simple forwarding to read().
There are now two thread lists, one for runnable threads and one for non-
runnable threads. Thread::set_state() is responsible for moving threads
between the lists.
Each thread also has a back-pointer to the list it's currently in.
Hook this up in Terminal so that the '\a' character generates a beep.
Finally emit an '\a' character in the shell line editing code when
backspacing at the start of the line.
Make the Socket functions take a FileDescriptor& rather than a socket role
throughout the code. Also change threads to block on a FileDescriptor,
rather than either an fd index or a Socket.
Add a Thread::is_thread(void*) helper that we can use to check that the
incoming donation beneficiary is a valid thread. The O(n) here is a bit sad
and we should eventually rethink the process/thread table data structures.
This introduces a tiny amount of timer drift which I will have to fix
somehow eventually, but it's a huge improvement in timing consistency
as we no longer suddenly jump from e.g 10:45:49.123 to 10:45:50.000.
The scheduler now operates on threads, rather than on processes.
Each process has a main thread, and can have any number of additional
threads. The process exits when the main thread exits.
This patch doesn't actually spawn any additional threads, it merely
does all the plumbing needed to make it possible. :^)
This is accomplished using a new Alarm class and a BlockedSnoozing state.
Basically, you call Process::snooze_until(some_alarm) and then the scheduler
won't wake up the process until some_alarm.is_ringing() returns true.
Only the receive timeout is hooked up yet. You can change the timeout by
calling setsockopt(..., SOL_SOCKET, SO_RCVTIMEO, ...).
Use this mechanism to make /bin/ping report timeouts.
Finally fixed the weird flaky crashing when resizing Terminal windows.
It was because we were dispatching a signal to "current" from the scheduler.
Yet another thing I dislike about even having a "current" process while
we're in the scheduler. Not sure yet how to fix this.
Let the signal handler's kernel stack be a kmalloc() allocation for now.
Once we can do allocation of consecutive physical pages in the supervisor
memory region, we can use that for all types of kernel stacks.
This is really cool! :^)
Apps currently refuse to start if the WindowServer isn't listening on the
socket in /wsportal. This makes sense, but I guess it would also be nice
to have some sort of "wait for server on startup" mode.
This has performance issues, and I'll work on those, but this stuff seems
to actually work and I'm very happy with that.
For now, the WindowServer process will run with high priority,
while the Finalizer process will run with low priority.
Everyone else gets to be "normal".
At the moment, priority simply determines the size of your time slices.
Since we know who's holding the lock, and we're gonna have to yield anyway,
we can just ask the scheduler to donate any remaining ticks to that process.
Instead of processes themselves getting scheduled to finish dying,
let's have a Finalizer process that wakes up whenever someone is dying.
This way we can do all kinds of lock-taking in process cleanup without
risking reentering the scheduler.
- Don't cli() in Process::do_exec() unless current is execing.
Eventually this should go away once the scheduler is less retarded
in the face of interrupts.
- Improved memory access validation for ring0 processes.
We now look at the kernel ELF header to determine if an access
is appropriate. :^) It's very hackish but also kinda neat.
- Have Process::die() put the process into a new "Dying" state where
it can still get scheduled but no signals will be dispatched.
This way we can keep executing in die() but won't get our EIP
hijacked by signal dispatch. The main problem here was that die()
wanted to take various locks.
Also add assertion in Lock that the scheduler isn't currently active.
I've been seeing occasional fuckups that I suspect might be someone called
by the scheduler trying to take a busy lock.
Also use an enum for the rather-confusing return value in dispatch_signal().
I will go through the rest of the signals and set them up with the
appropriate default dispositions at some other point.
It automagically computes %CPU usage based on the number of times a process
has been scheduled between samples. The colonel task is used as idle timer.
This is pretty cool. :^)
GObjects can now register a timer with the GEventLoop. This will eventually
cause GTimerEvents to be dispatched to the GObject.
This needed a few supporting changes in the kernel:
- The PIT now ticks 1000 times/sec.
- select() now supports an arbitrary timeout.
- gettimeofday() now returns something in the tv_usec field.
With these changes, the clock window in guitest2 finally ticks on its own.
The system can finally idle without burning CPU. :^)
There are some issues with scheduling making the mouse cursor sloppy
and unresponsive that need to be dealt with.
Userspace programs can now open /dev/gui_events and read a stream of GUI_Event
structs one at a time.
I was stuck on a stupid problem where we'd reenter Scheduler::yield() due to
having one of the has_data_available_for_reading() implementations using locks.
The kernel now bills processes for time spent in kernelspace and userspace
separately. The accounting is forwarded to the parent process in reap().
This makes the "time" builtin in bash work.
This way the scheduler doesn't need to plumb the exit status into the waiter.
We still plumb the waitee pid though, I don't love it but it can be fixed.
I was surprised to find that dup()'ed fds don't share the close-on-exec flag.
That means it has to be stored separately from the FileDescriptor object.
- Process::exec() needs to restore the original paging scope when called
on a non-current process.
- Add missing InterruptDisabler guards around g_processes access.
- Only flush the TLB when modifying the active page tables.