A process that is not in the foreground process group of a TTY should
not be allowed to read/write that TTY. Instead, we now send either a
SIGTTIN (on read) or SIGTTOU (on write) signal to the process, and fail
the I/O syscall with EINTR.
Fixes#205.
- Remove goofy _r suffix from syscall names.
- Don't take a signed buffer size.
- Use Userspace<T>.
- Make TTY::tty_name() return a String instead of a StringView.
This syscall allows a parent process to disown a child process, setting
its parent PID to 0.
Unparented processes are automatically reaped by the kernel upon exit,
and no sys$waitid() is required. This will make it much nicer to do
spawn-and-forget which is common in the GUI environment.
Allow passing in an optional timeout to Thread::block and move
the timeout check out of Thread::Blocker. This way all Blockers
implicitly support timeouts and don't need to implement it
themselves. Do however allow them to override timeouts (e.g.
for sockets).
We need to have a Thread lock to protect threading related
operations, such as Thread::m_blocker which is used in
Thread::block.
Also, if a Thread::Blocker indicates that it should be
unblocking immediately, don't actually block the Thread
and instead return immediately in Thread::block.
This fixes a regression introduced by the new software context
switching where the Kernel would not deliver a signal unless the
process is making system calls. This is because the TSS no longer
updates the CS value, so the scheduler never considered delivery
as the process always appeared to be in kernel mode. With software
context switching we can just set up the signal trampoline at
any time and when the processor returns back to user mode it'll
get executed. This should fix e.g. killing programs that are
stuck in some tight loop that doesn't make any system calls and
is only pre-empted by the timer interrupt.
Fixes#2958
By making the Process class RefCounted we don't really need
ProcessInspectionHandle anymore. This also fixes some race
conditions where a Process may be deleted while still being
used by ProcFS.
Also make sure to acquire the Process' lock when accessing
regions.
Last but not least, there's no reason why a thread can't be
scheduled while being inspected, though in practice it won't
happen anyway because the scheduler lock is held at the same
time.
Upon leaving a critical section (such as a SpinLock) we need to
check if we're already asynchronously invoking the Scheduler.
Otherwise we might end up triggering another context switch
as soon as leaving the scheduler lock.
Fixes#2883
Note: I switched from copying the single element out of the sched_param
struct, to copy struct it self as it is identical in functionality.
This way the types match up nicer with the Userpace<T> api's and it
conforms to the conventions used in other syscalls.
This commit adds an implementation of memmem, using the Bitap text
search algorithm for needles smaller than 32 bytes, and a naive loop
search for longer needles.
Since we already have the type information in the Userspace template,
it was a bit silly to cast manually everywhere. Just add a sufficiently
scary-sounding getter for a typed pointer.
Thanks @alimpfard for pointing out that I was being silly with tossing
out the type.
In the future we may want to make this API non-public as well.
This unbreaks the gcc and binutils ports.
Previously, when _SC_PAGESIZE was missing, these packages opted to
use their own versions of getpagesize which made their build fail
because of conflicting definitions of the function.
Use copy_{to,from}_user() in the various File::ioctl() implementations
instead of disabling SMAP wholesale in sys$ioctl().
This patch does not port IPv4Socket::ioctl() to those API's since that
will be more involved. That function now creates a local SmapDisabler.
This is something I've been meaning to do for a long time, and here we
finally go. This patch moves all sys$foo functions out of Process.cpp
and into files in Kernel/Syscalls/.
It's not exactly one syscall per file (although it could be, but I got
a bit tired of the repetitive work here..)
This makes hacking on individual syscalls a lot less painful since you
don't have to rebuild nearly as much code every time. I'm also hopeful
that this makes it easier to understand individual syscalls. :^)
I decided to play around with trying to run Serenity in VirtualBox.
It crashed WindowServer with a beautiful array of multi-color
flashing letters :^)
Skipping getting side-tracked seeing that it chose MBVGA in the
serial debug and trying to debug why it caused such a display,
I finally checked BXVGA.
While find_framebuffer_address checks for VBoxVGA, init_stage2 didn't.
Whoops!
We were masking the fragment offset bits incorrectly in the IPv4 header
sent out with fragments. This worked up to ~32KB but after that, things
would get very confused. :^)
Because Thread::sleep is an internal interface, it's easy to check that there
are only few callers: Process::sys$sleep, usleep, and nanosleep are happy
with this increased size, because now they support the entire range of their
arguments (assuming small-ish values for ticks_per_second()).
SyncTask doesn't care.
Note that the old behavior wasn't "cap out at 388 days", which would have been
reasonable. Instead, the code resulted in unsigned overflow, meaning that a
very long sleep would "on average" end after about 194 days, sometimes much
quicker.
Fixes#2871.
Ignoring the 'securely generated bytes' constraint seems to
be fine for Linux, so it's probably fine for Serenity.
Note that there *might* be more bottlenecks down the road
if Serenity is started in a non-GUI way. Currently though,
loading the GUI seems to generate enough interrupts to
seed the entropy pool, even on my non-RDRAND setup. Yay! :^)
Add all 6 shortcuts even if the switch between VirtualConsoles is
currently not available in the graphical console.
Also make the case statement more compact.
It is possible to switch to VirtualConsoles 1 to 4 via the shortcut
ALT + [1-4]. Therefor the array of VirtualConsoles should be guaranteed
to be initialized.
Also add an constant for the maximum number of VirtualConsoles to
guarantee consistency.
For now, only the non-standard _SC_NPROCESSORS_CONF and
_SC_NPROCESSORS_ONLN are implemented.
Use them to make ninja pick a better default -j value.
While here, make the ninja package script not fail if
no other port has been built yet.
To be more in line with other parts of Serenity's procfs, the
"key: value" format of /proc/cpuinfo was replaced with JSON, namely
an array of objects (one for each core).
The available keys remain the same, though "features" has been changed
from a space-separated string to an array of strings.
We need to halt the BSP briefly until all APs are ready for the
first context switch, but we can't hold the same spinlock by all
of them while doing so. So, while the APs are waiting on each other
they need to release the scheduler lock, and then once signaled
re-acquire it. Should solve some timing dependent hangs or crashes,
most easily observed using qemu with kvm disabled.
We now have BlockResult::WokeNormally and BlockResult::NotBlocked,
both of which indicate no error. We can no longer just check for
BlockResult::WokeNormally and assume anything else must be an
interruption.
The AT_* entries are placed after the environment variables, so that
they can be found by iterating until the end of the envp array, and then
going even further beyond :^)
MemoryManager::quickmap_pd and MemoryManager::quickmap_pt can only
be called by one processor at the time anyway, since anything using
these must have the MM lock held. So, no need to inform the other
CPUs to flush their TLBs, we can just flush our own.
We can now properly initialize all processors without
crashing by sending SMP IPI messages to synchronize memory
between processors.
We now initialize the APs once we have the scheduler running.
This is so that we can process IPI messages from the other
cores.
Also rework interrupt handling a bit so that it's more of a
1:1 mapping. We need to allocate non-sharable interrupts for
IPIs.
This also fixes the occasional hang/crash because all
CPUs now synchronize memory with each other.
The short-circuit path added for waiting on a queue that already had a
pending wake was able to return with interrupts disabled, which breaks
the API contract of wait_on() always returning with IF=1.
Fix this by adding a way to override the restored IF in ScopedCritical.
If WaitQueue::wake_all, WaitQueue::wake_one, or WaitQueue::wake_n
is called but nobody is currently waiting, we should remember that
fact and prevent someone from waiting after such a request. This
solves a race condition where the Finalizer thread is notified
to finalize a thread, but it is not (yet) waiting on this queue.
Fixes#2693
These changes solve a number of problems with the software
context swithcing:
* The scheduler lock really should be held throughout context switches
* Transitioning from the initial (idle) thread to another needs to
hold the scheduler lock
* Transitioning from a dying thread to another also needs to hold
the scheduler lock
* Dying threads cannot necessarily be finalized if they haven't
switched out of it yet, so flag them as active while a processor
is running it (the Running state may be switched to Dying while
it still is actually running)
We need to be able to prevent a WaitQueue from being
modified by another CPU. So, add a SpinLock to it.
Because this pushes some other class over the 64 byte
limit, we also need to add another 128-byte bucket to
the slab allocator.
The Lock class still permits no reason, but for everything else
require a reason to be passed to Thread::wait_on. This makes it
easier to diagnose why a Thread is in Queued state.
FileBackedFileSystem is one that's backed by (mounted from) a file, in other
words one that has a "source" of the mount; that doesn't mean it deals in
blocks. The hierarchy now becomes:
* FS
* ProcFS
* DevPtsFS
* TmpFS
* FileBackedFS
* (future) Plan9FS
* BlockBasedFS
* Ext2FS
There isn't an easy way to retreive all register contents anymore,
so remove this functionality. We do have the ability to trace
processes, so it shouldn't really be needed anymore.
If we're trying to walk the stack for another thread, we can
no longer retreive the EBP register from Thread::m_tss. Instead,
we need to look at the top of the kernel stack, because all threads
not currently running were last in kernel mode. Context switches
now always trigger a brief switch to kernel mode, and Thread::m_tss
only is used to save ESP and EIP.
Fixes#2678
When delivering urgent signals to the current thread
we need to check if we should be unblocked, and if not
we need to yield to another process.
We also need to make sure that we suppress context switches
during Process::exec() so that we don't clobber the registers
that it sets up (eip mainly) by a context switch. To be able
to do that we add the concept of a critical section, which are
similar to Process::m_in_irq but different in that they can be
requested at any time. Calls to Scheduler::yield and
Scheduler::donate_to will return instantly without triggering
a context switch, but the processor will then asynchronously
trigger a context switch once the critical section is left.
If a partial write succeeded, we could then be in an unexpected state
where the file description was non-blocking, but we could no longer
write to it.
Previously, the kernel would block in that state, but instead we now
handle this as a proper short write and return the number of bytes
we were able to write.
Fixes#2645.
The run script is not in Kernel/ anymore, let's move `.bochsrc` in Meta/
so that it can be used with the new build system.
Also make bochs use `grub_disk_image` instead of `_disk_image`
- If rdseed is not available, fallback to rdrand.
- If rdrand is not available, block for entropy, or use insecure prng
depending on if user wants fast or good random.
CPUs which support RDRAND do not necessarily support RDSEED. This
introduces a flag g_cpu_supports_rdseed which is set appropriately
by CPUID. This causes Haswell CPUs in particular (and probably a lot
of AMD chips) to now fail to boot with #2634, rather than an illegal
instruction.
It seems like the KernelRng needs either an initial reseed call or
more random events added before the first call to get_good_random,
but I don't feel qualified to make that kind of change.
Random now gets entropy from the following drivers:
- KeyboardDevice
- PATAChannel
- PS2MouseDevice
- E1000NetworkAdapter
- RTL8139NetworkAdapter
Of these devices, PS2MouseDevice and PATAChannel provide the vast
majority of the entropy.
These APIs were clearly modeled after Ext2FS internals, and make perfect sense
in Ext2FS context. The new APIs are more generic, and map better to the
semantics exported to the userspace, where inode identifiers only appear in
stat() and readdir() output, but never in any input.
This will also hopefully reduce the potential for races (see commit c44b4d61f3).
Lastly, this makes it way more viable to implement a filesystem that only
synthesizes its inodes lazily when queried, and destroys them when they are no
longer in use. With inode identifiers being used to reference inodes, the only
choice for such a filesystem is to persist any inode it has given out the
identifier for, because it might be queried at any later time. With direct
references to inodes, the filesystem will know when the last reference is
dropped and the inode can be safely destroyed.
These new syscalls allow you to send and receive file descriptors over
a local domain socket. This will enable various privilege separation
techniques and other good stuff. :^)
ppoll() is similar() to poll(), but it takes its timeout
as timespec instead of as int, and it takes an additional
sigmask parameter.
Change the sys$poll parameters to match ppoll() and implement
poll() in terms of ppoll().
pselect() is similar() to select(), but it takes its timeout
as timespec instead of as timeval, and it takes an additional
sigmask parameter.
Change the sys$select parameters to match pselect() and implement
select() in terms of pselect().
It looks like they're considered a bad idea, so let's not add
them before we need them. I figured it's good to have them in
git history if we ever do need them though, hence the add/remove
dance.
Add seteuid()/setegid() under _POSIX_SAVED_IDS semantics,
which also requires adding suid and sgid to Process, and
changing setuid()/setgid() to honor these semantics.
The exact semantics aren't specified by POSIX and differ
between different Unix implementations. This patch makes
serenity follow FreeBSD. The 2002 USENIX paper
"Setuid Demystified" explains the differences well.
In addition to seteuid() and setegid() this also adds
setreuid()/setregid() and setresuid()/setresgid(), and
the accessors getresuid()/getresgid().
Also reorder uid/euid functions so that they are the
same order everywhere (namely, the order that
geteuid()/getuid() already have).
The "Reference" object is not just a counter, it also represents the
permission to map a shbuf itself.
Without this change, a shbuf could not be re-mapped by the same
process after it released all of its refs on it.
We were getting a little overly memey in some places, so let's scale
things back to business-casual.
Informal language is fine in comments, commits and debug logs,
but let's keep the runtime nice and presentable. :^)
This fixes a bug where the mode of a FIFO was reported as 001000 instead
of 0010000 (you see the difference? me nethier), and hopefully doesn't
introduce new bugs. I've left 0777 and similar in a few places, because
that is *more* readable than its symbolic version.
That's not how readlink() is supposed to work: it should copy as many bytes
as fit into the buffer, and return the number of bytes copied. So do that,
but add a twist: make sys$readlink() actually return the whole size, not
the number of bytes copied. We fix up this return value in userspace, to make
LibC's readlink() behave as expected, but this will also allow other code
to allocate a buffer of just the right size.
Also, avoid an extra copy of the link target.
Get rid of the weird old signature:
- int StringType::to_int(bool& ok) const
And replace it with sensible new signature:
- Optional<int> StringType::to_int() const
Since we're not keeping compatibility with OpenBSD about what promises are
required for which syscalls, tighten things up so that they make more sense.
We were not setting the DMA transfer mode correctly. I have absolutely
no clue how this could ever have worked, but it did work for months
until it suddenly didn't.
Anyways, this fixes that. The sound is still a little bit glitchy and
that could probably be fixed by using the SB16's auto-initialized mode.
You can now request an update of the terminal's window progress by
sending this escape sequence:
<esc>]9;<value>;<max_value>;<escape><backslash>
I'm sure we can find many interesting uses for this! :^)
I've been using this in the new HTML parser and it makes it much easier
to understand the state of unfinished code branches.
TODO() is for places where it's okay to end up but we need to implement
something there.
ASSERT_NOT_REACHED() is for places where it's not okay to end up, and
something has gone wrong.
The SDL port failed to build because the CMake toolchain filed pointed
to the old root. Now the toolchain file assumes that the Root is in
Build/Root.
Additionally, the AK/ and Kernel/ headers need to be installed in the
root too.
.. and make travis run it.
I renamed check-license-headers.sh to check-style.sh and expanded it so
that it now also checks for the presence of "#pragma once" in .h files.
It also checks the presence of a (single) blank line above and below the
"#pragma once" line.
I also added "#pragma once" to all the files that need it: even the ones
we are not check.
I also added/removed blank lines in order to make the script not fail.
I also ran clang-format on the files I modified.
This adds support for MS_RDONLY, a mount flag that tells the kernel to disallow
any attempts to write to the newly mounted filesystem. As this flag is
per-mount, and different mounts of the same filesystems (such as in case of bind
mounts) can have different mutability settings, you have to go though a custody
to find out if the filesystem is mounted read-only, instead of just asking the
filesystem itself whether it's inherently read-only.
This also adds a lot of checks we were previously missing; and moves some of
them to happen after more specific checks (such as regular permission checks).
One outstanding hole in this system is sys$mprotect(PROT_WRITE), as there's no
way we can know if the original file description this region has been mounted
from had been opened through a readonly mount point. Currently, we always allow
such sys$mprotect() calls to succeed, which effectively allows anyone to
circumvent the effect of MS_RDONLY. We should solve this one way or another.
If we fail to exec() the target executable, don't leak the thread (this actually
triggers an assertion when destructing the process), and print an error message.
When mounting Ext2FS, we don't care if the file has a custody (it doesn't if
it's a device, which is a common case). When doing a bind-mount, we do need a
custody; if none is provided, let's return an error instead of crashing.
VFS no longer deals with inodes in public API, only with custodies and file
descriptions. Talk directly to the file system if you need to operate on a
inode. In most cases you actually want to go though VFS, to get proper
permission check and other niceties. For this to work, you have to provide a
custody, which describes *how* you have opened the inode, not just what the
inode is.
We're going to make use of it in the next commit. But the idea is we want to
know how this File (more specifically, InodeFile) was opened in order to decide
how chown()/chmod() should behave, in particular whether it should be allowed or
not. Note that many other File operations, such as read(), write(), and ioctl(),
already require the caller to pass a FileDescription.
Together, they replace the old text_debug option.
* boot_mode should be either "graphical" (the default) or "text". We could
potentially support other values here in the future.
* init specifies which userspace process the kernel should spawn to bootstrap
userspace. By default, this is SystemServer, but you can specify e.g.
init=/bin/Shell to run system diagnostics.
And move canonicalized_path() to a static method on LexicalPath.
This is to make it clear that FileSystemPath/canonicalized_path() only
perform *lexical* canonicalization.
You now have to pledge "sigaction" to change signal handlers/dispositions. This
is to prevent malicious code from messing with assertions (and segmentation
faults), which are normally expected to instantly terminate the process but can
do other things if you change signal disposition for them.
The is_error() check on the KResultOr returned when reading the link
target had a stray ! operator which causes link resolution to crash the
kernel with an assertion error.
Allow file system implementation to return meaningful error codes to
callers of the FileDescription::read_entire_file(). This allows both
Process::sys$readlink() and Process::sys$module_load() to return more
detailed errors to the user.
In case WNOHANG was specified, we want to always set should_unblock to
true (which we do since commit 4402207b98), not
wait_finished -- the latter causes us to immediately return this child to our
caller, which is not what we want -- perhaps we should return another child
which has actually exited or stopped, or nobody at all.
To avoid confusion, also rename wait_finished to fits_the_spec.
This fixes service keepalive functionality in SystemServer.
Add a MappedROM::find_chunk_starting_with() helper since that's a very
common usage pattern in clients of this code.
Also convert MultiProcessorParser from a persistent singleton object
to a temporary object constructed via a failable factory function.
This patch adds a MappedROM abstraction to the Kernel VM subsystem.
It's basically the read-only byte buffer equivalent of a TypedMapping.
We use this in the ACPI and MP table parsers to scan for interesting
stuff in low memory instead of doing a bunch of address arithmetic.
For singly-indirect blocks, "callback" is just "add_block".
For doubly-indirect blocks, "callback" is the lambda function
iterating on singly-indirect blocks: so instead of adding itself to the
list, the doubly-indirect block will add all its childs, but they add
themselves again when they run the callback of singly-indirect blocks.
And nothing adds the doubly-indirect block itself :(
This leads to a double free of all child blocks of the doubly-indirect
block, which is the failed assert described in #1549.
Closes: #1549.
Let's not be paying the function call overhead for these tiny ops.
Maybe there's an argument for having fewer gadgets in the kernel but
for now we're actually seeing stac() in profiles so let's put
that above theoretical security issues.
If these methods get inlined, the compiler is able to statically eliminate most
of the assertions. Alas, it doesn't realize this, and believes inlining them to
be too expensive. So give it a strong hint that it's not the case.
This *decreases* the kernel binary size.
Make sure that userspace is always referencing "system" headers in a way
that would build on target :). This means removing the explicit
include_directories of Libraries/LibC in favor of having it export its
headers as SYSTEM. Also remove a redundant include_directories of
Libraries in the 'serenity build' part of the build script. It's already
set at the top.
This causes issues for the Kernel, and for crt0.o. These special cases
are handled individually.
read_block() and write_block() now accept the count (how many bytes to read
or write) and offset (where in the block to start; defaults to 0). Using these
new APIs, we can avoid doing copies between intermediary buffers in a lot more
cases. Hopefully this improves performance or something.
This was a holdover from the old times when each Process had a special
main thread with TID 0. Using it was a total crapshoot since it would
just return whichever thread was first on the process's thread list.
Now that I've removed all uses of it, we don't need it anymore. :^)
Instead of falling back to the suspicious "any_thread()" mechanism,
just fail with ESRCH if you try to kill() a PID that doesn't have a
corresponding TID.
This was supposed to be the foundation for some kind of pre-kernel
environment, but nobody is working on it right now, so let's move
everything back into the kernel and remove all the confusion.
We stopped using gettimeofday() in Core::EventLoop a while back,
in favor of clock_gettime() for monotonic time.
Maintaining an optimization for a syscall we're not using doesn't make
a lot of sense, so let's go back to the old-style sys$gettimeofday().
You can still open files that have sockets attached to them from inside
the kernel via VFS::open() (and in fact, that is what LocalSocket itslef uses),
but trying to do that from userspace using open() will now fail with ENXIO.
This reverts commit 3d342f72a7.
This is causing trouble for macOS users. Also it's painfully slow
compared to using the sudo method. This should definitely not be
the default since it punishes people who have genext2fs installed.
This is very basic and doesn't support many features. Instead
of describing what it *doesn't* support, I'll describe what I
have tested:
1. Public key authentication (password is not supported)
2. Single command execution
3. PTY-less interactive bash shell (/bin/sh doesn't work)
4. Multi-user (i.e you can ssh as 'anon' as well as root)
Step one of moving DesktopServices::open handling out of process. This
makes it easier to do things like read in associations for which program
opens which files or protocols. This gives users the ability to modify
the associations without having to rebuild :^)