This code never worked, as was never used for anything. We can build
a much better SHM implementation on top of TmpFS or similar when we
get to the point when we need one.
The new PCI subsystem is initialized during runtime.
PCI::Initializer is supposed to be called during early boot, to
perform a few tests, and initialize the proper configuration space
access mechanism. Kernel boot parameters can be specified by a user to
determine what tests will occur, to aid debugging on problematic
machines.
After that, PCI::Initializer should be dismissed.
PCI::IOAccess is a class that is derived from PCI::Access
class and implements PCI configuration space access mechanism via x86
IO ports.
PCI::MMIOAccess is a class that is derived from PCI::Access
and implements PCI configurtaion space access mechanism via memory
access.
The new PCI subsystem also supports determination of IO/MMIO space
needed by a device by checking a given BAR.
In addition, Every device or component that use the PCI subsystem has
changed to match the last changes.
This patch hardens /proc a bit by making many things only accessible
to UID 0, and also disallowing access to /proc/PID/ for anyone other
than the UID of that process (and superuser, obviously.)
Threads now have numeric priorities with a base priority in the 1-99
range.
Whenever a runnable thread is *not* scheduled, its effective priority
is incremented by 1. This is tracked in Thread::m_extra_priority.
The effective priority of a thread is m_priority + m_extra_priority.
When a runnable thread *is* scheduled, its m_extra_priority is reset to
zero and the effective priority returns to base.
This means that lower-priority threads will always eventually get
scheduled to run, once its effective priority becomes high enough to
exceed the base priority of threads "above" it.
The previous values for ThreadPriority (Low, Normal and High) are now
replaced as follows:
Low -> 10
Normal -> 30
High -> 50
In other words, it will take 20 ticks for a "Low" priority thread to
get to "Normal" effective priority, and another 20 to reach "High".
This is not perfect, and I've used some quite naive data structures,
but I think the mechanism will allow us to build various new and
interesting optimizations, and we can figure out better data structures
later on. :^)
This is memory that's loaded from an inode (file) but not modified in
memory, so still identical to what's on disk. This kind of memory can
be freed and reloaded transparently from disk if needed.
Dirty private memory is all memory in non-inode-backed mappings that's
process-private, meaning it's not shared with any other process.
This patch exposes that number via SystemMonitor, giving us an idea of
how much memory each process is responsible for all on its own.
We were listing the total number of user/super pages as the number of
"available" pages in the system. This was then misinterpreted in the
SystemMonitor program and displayed wrong in the GUI.
Cautiously use 5 as a limit for now so that we don't blow the stack.
This can be increased in the future if we are sure that we won't be
blowing the stack, or if the implementation is changed to not use
recursion :^)
Every process keeps its own ELF executable mapped in memory in case we
need to do symbol lookup (for backtraces, etc.)
Until now, it was mapped in a way that made it accessible to the
program, despite the program not having mapped it itself.
I don't really see a need for userspace to have access to this right
now, so let's lock things down a little bit.
This patch makes it inaccessible to userspace and exposes that fact
through /proc/PID/vm (per-region "user_accessible" flag.)
Currently only Ext2FS and TmpFS supports InodeWatchers. We now fail
with ENOTSUPP if watch_file() is called on e.g ProcFS.
This fixes an issue with FileManager chewing up all the CPU when /proc
was opened. Watchers don't keep the watched Inode open, and when they
close, the watcher FD will EOF.
Since nothing else kept /proc open in FileManager, the watchers created
for it would EOF immediately, causing a refresh over and over.
Fixes#879.
The kernel now supports basic profiling of all the threads in a process
by calling profiling_enable(pid_t). You finish the profiling by calling
profiling_disable(pid_t).
This all works by recording thread stacks when the timer interrupt
fires and the current thread is in a process being profiled.
Note that symbolication is deferred until profiling_disable() to avoid
adding more noise than necessary to the profile.
A simple "/bin/profile" command is included here that can be used to
start/stop profiling like so:
$ profile 10 on
... wait ...
$ profile 10 off
After a profile has been recorded, it can be fetched in /proc/profile
There are various limits (or "bugs") on this mechanism at the moment:
- Only one process can be profiled at a time.
- We allocate 8MB for the samples, if you use more space, things will
not work, and probably break a bit.
- Things will probably fall apart if the profiled process dies during
profiling, or while extracing /proc/profile
Okay, one "dunce hat" point for me. The new PTY majors conflicted with
PATAChannel. Now they are 200 for master and 201 for slave, not used
by anything else.. I hope!
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():
- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)
When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)
Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
Using int was a mistake. This patch changes String, StringImpl,
StringView and StringBuilder to use size_t instead of int for lengths.
Obviously a lot of code needs to change as a result of this.
The main thread of each kernel/user process will take the name of
the process. Extra threads will get a fancy new name
"ProcessName[<tid>]".
Thread backtraces now list the thread name in addtion to tid.
Add the thread name to /proc/all (should it get its own proc
file?).
Add two new syscalls, set_thread_name and get_thread_name.
This patch adds these I/O counters to each thread:
- (Inode) file read bytes
- (Inode) file write bytes
- Unix socket read bytes
- Unix socket write bytes
- IPv4 socket read bytes
- IPv4 socket write bytes
These are then exposed in /proc/all and seen in SystemMonitor.
This defaults to 1500 for all adapters, but LoopbackAdapter increases
it to 65536 on construction.
If an IPv4 packet is larger than the MTU, we'll need to break it into
smaller fragments before transmitting it. This part is a FIXME. :^)
Turns out we can use abi::__cxa_demangle() for this, and all we need to
provide is sprintf(), realloc() and free(), so this patch exposes them.
We now have fully demangled C++ backtraces :^)
Previously it was not possible to see what each thread in a process was
up to, or how much CPU it was consuming. This patch fixes that.
SystemMonitor and "top" now show threads instead of just processes.
"ps" is gonna need some more fixing, but it at least builds for now.
Fixes#66.
This adds a call to set_metadata_dirty(true) to
Ext2FS::write_directory(). This fixes a bug wherein InodeWatchers
weren't alerted on directory updates.
Scheduling priority is now set at the thread level instead of at the
process level.
This is a step towards allowing processes to set different priorities
for threads. There's no userspace API for that yet, since only the main
thread's priority is affected by sched_setparam().
Files opened with O_DIRECT will now bypass the disk cache in read/write
operations (though metadata operations will still hit the disk cache.)
This will allow us to test actual disk performance instead of testing
disk *cache* performance, if that's what we want. :^)
There's room for improvment here, we're very aggressively flushing any
dirty cache entries for the specific block before reading/writing that
block. This is done by walking the entire cache, which may be slow.
Don't keep Inodes around in memory forever after we've interacted with
them once. This is a slight performance pessimization when accessing
the same file repeatedly, but closing it for a while in between.
Longer term we should find a way to keep a limited number of unused
Inodes cached, whichever ones we think are likely to be used again.
When creating a new directory, we set the initial size to 1 block.
This meant that we were allocating a block up front, but the Inode's
internal block list cache was not populated with this block.
This broke write_bytes() on a new directory, since it assumed that
the block list cache would be up to date if the call to write_bytes()
would not change the directory's size.
This patch fixes the issue in two ways: First, we cache the initial
block list created for new directories.
Second, we now repopulate the block list cache in write_bytes() if it
is empty when we get there. This is basically just a safety fallback
to avoid having this kind of bug in the future.
We can't be calling the virtual FS::flush_writes() in order to flush
the disk cache from within the disk cache, since an FS subclass may
try to do cache stuff in its flush_writes() implementation.
Instead, separate out the implementation of DiskBackedFS's flushing
logic into a flush_writes_impl() and call that from the cache code.
Also cache the block group descriptor table in a KBuffer on file system
initialization, instead of on first access.
This reduces pressure on the kmalloc heap somewhat.
We were writing out the full block list whenever Ext2FSInode::resize()
was called, even if the old and new sizes were identical.
This patch makes it a no-op, which drastically improves "cp" speed
since we now take full advantage of the up-front call to ftruncate().
If there are not enough free blocks in the filesystem to accomodate
growing an Inode, we should fail with ENOSPC before even starting to
allocate blocks.
Add a simple cache to Ext2FS where we keep block bitmaps along with a
dirty bit. This allows us to coalesce bitmap flushes, giving us a nice
~3x improvement in disk_benchmark write speeds.
Store the cached super block as an ext2_super_block member instead of
caching it in a ByteBuffer and using a casting helper everywhere.
This patch also combines reading/writing of the super block into a
single disk device operation (instead of two.)
Add dedicated internal types for Int64 and UnsignedInt64. This makes it
a bit more straightforward to work with 64-bit numbers (instead of just
implicitly storing them as doubles.)
I have no idea why this was here. It makes no sense. If you're trying
to find out if something is a directory, why wouldn't you be allowed to
ask that about a FIFO? :^)
Thanks to Brandon for spotting this!
Also, while we're here, cache the directory state in a bool member so
we don't have to keep fetching inode metadata when checking this
repeatedly. This is important since sys$read() now calls it.
Previously, procfs$pid_fds would return nothing when called
for a process that had either no open files or a non-existent
handle. This could cause problems when a userspace program
expected a valid Json response.
Procfs$pid_fs now returns an empty array in the aforementioned
cases.
This reverts commit 1cca5142af.
This appears to be causing intermittent triple-faults and I don't know
why yet, so I'll just revert it to keep the tree in decent shape.
Background: DoubleBuffer is a handy buffer class in the kernel that
allows you to keep writing to it from the "outside" while the "inside"
reads from it. It's used for things like LocalSocket and PTY's.
Internally, it has a read buffer and a write buffer, but the two will
swap places when the read buffer is exhausted (by reading from it.)
Before this patch, it was internally implemented as two Vector<u8>
that we would swap between when the reader side had exhausted the data
in the read buffer. Now instead we preallocate a large KBuffer (64KB*2)
on DoubleBuffer construction and use that throughout its lifetime.
This removes all the kmalloc heap traffic caused by DoubleBuffers :^)
The Cache entries found in `DiskBackedFileSystem` are now stored in a
`KBuffer` object, instead of relying on `kmalloc_eternal`. The number
of entries was exceeding that of the number of bytes allocated to
`kmalloc_eternal`, which in turn caused `mount()` to fail epically
when called.
This patch adds three separate per-process fault counters:
- Inode faults
An inode fault happens when we've memory-mapped a file from disk
and we end up having to load 1 page (4KB) of the file into memory.
- Zero faults
Memory returned by mmap() is lazily zeroed out. Every time we have
to zero out 1 page, we count a zero fault.
- CoW faults
VM objects can be shared by multiple mappings that make their own
unique copy iff they want to modify it. The typical reason here is
memory shared between a parent and child process.
This patch overloads Inode::is_directory() with a faster version that
doesn't require instantiating the whole InodeMetadata.
If you have an Ext2FSInode&, calling is_directory() should be instant
since we can just look directly at the raw inode bits.
If we want to make a new entry in the disk cache when it's completely
full of dirty blocks, we'll now synchronously flush the writes at that
point. Maybe it's not ideal, but at least we can keep going.
This way clients are not required to have instantiated ByteBuffers
and can choose whatever memory scheme works best for them.
Also converted some of the Ext2FS code to use stack buffers instead.
Instead of having DiskBackedFS allocate a ByteBuffer, leave it to each
client to provide buffer space.
This is significantly faster in many cases where we can use a stack
buffer and avoid heap allocation entirely.
The hashmap cache was ridiculously slow and hurt us more than it helped
us. This patch replaces it with a flat memory cache that keeps up to
10'000 blocks in cache with a simple dirty bit.
The syncd task will wake up periodically and call flush_writes() on all
file systems, which now causes us to traverse the cache and write all
dirty blocks to disk.
There's a ton of room for improvement here, but this itself is already
drastically better when doing repeated GCC invocations.
1) Off-by-one in block allocation when block size != 1 KB
Due to a quirk in the on-disk layout of ext2, file systems with a block
size of 1 KB have block #1 as their first block, while all others start
on block #0.
2) We had no fallback mechanism when the preferred group was full
We now allocate blocks from the preferred block group as long as it's
possible, and fall back to a simple scan through all groups when the
preferred one is full.
This is a freelist allocator with static size classes that works as a
complement to the generic kmalloc(). It's a lot faster than kmalloc()
since allocation just means popping from the freelist.
It's also significantly more compact when there are a lot of objects
smaller than the minimum kmalloc chunk size (32 bytes.)
This patch enables it for the Region and PhysicalPage classes.
In the PhysicalPage (8 bytes) case, it's a huge improvement since we
no longer waste 75% of the storage allocated.
There are also a number of ways this can be improved, so let's keep
working on it going forward.
Also added some assertions to DirectoryEntry in case someone tries to
instantiate them with names that would overflow the name buffer.
DirectoryEntry is a crappy data structure, and the name buffer is also
crappy. Added a FIXME about replacing it with something nicer.
Before this patch, the DirectoryEntry::name buffer would overflow if
you did "touch extremely-long-file-name". Duh.
Fixes#538.
This was a workaround to be able to build on case-insensitive file
systems where it might get confused about <string.h> vs <String.h>.
Let's just not support building that way, so String.h can have an
objectively nicer name. :^)
- TmpFSInode::write_bytes() needs to allow non-zero offsets
- TmpFSInode::read_bytes() wasn't respecting the offset
GCC puts the temporary files generated during compilation in /tmp,
so this exposed some bugs in TmpFS.
The complication is around /proc/sys/ variables, which were attached
to inodes. Now they're their own thing, and the corresponding inodes
are lazily created (as all other ProcFS inodes are) and simply refer
to them by index.
It is now possible to unmount file systems from the VFS via `umount`.
It works via looking up the `fsid` of the filesystem from the `Inode`'s
metatdata so I'm not sure how fragile it is. It seems to work for now
though as something to get us going.
This is more logical and allows us to solve the problem of
non-blocking TCP sockets getting stuck in SocketRole::None.
The only complication is that a single LocalSocket may be shared
between two file descriptions (on the connect and accept sides),
and should have two different roles depending from which side
you look at it. To deal with it, Socket::role() is made a
virtual method that accepts a file description, and LocalSocket
internally tracks which FileDescription is the which one and
returns a correct role.
Now that there can't be multiple clones of the same fd,
we only need to track whether or not an fd exists on each
side. Also there's no point in tracking connecting fds.
After a fork, the parent and the child are supposed to share
the same file description. For example, modifying the current
offset of a file description is visible in both of them.
Using a HashTable to track "all instances of Foo" is only useful if we
actually need to look up entries by some kind of index. And since they
are HashTable (not HashMap), the pointer *is* the index.
Since we have the pointer, we can just use it directly. Duh.
This increase sizeof(VMObject) by two pointers, but removes a global
table that had an entry for every VMObject, where the cost was higher.
It also avoids all the general hash tabling business when creating or
destroying VMObjects. Generally we should do more of this. :^)
This is comprised of five small changes:
* Keep a counter for tx/rx packets/bytes per TCP socket
* Keep a counter for tx/rx packets/bytes per network adapter
* Expose that data in /proc/net_tcp and /proc/netadapters
* Convert /proc/netadapters to JSON
* Fix up ifconfig to read the JSON from netadapters
This is not perfect as it uses a lot of VM, but since the buffers are
supposed to be temporary it's not super terrible.
This could be improved by giving back the unused VM to the kernel's
RangeAllocator after finishing the buffer building.
This makes VMObject 8 bytes smaller since we can use the array size as
the page count.
The size() is now also computed from the page count instead of being
a separate value. This makes sizes always be a multiple of PAGE_SIZE,
which is sane.
InodeVMObject is a VMObject with an underlying Inode in the filesystem.
AnonymousVMObject has no Inode.
I'm happy that InodeVMObject::inode() can now return Inode& instead of
VMObject::inode() return Inode*. :^)
The VMObject name was always either the owning region's name, or the
absolute path of the underlying inode.
We can reconstitute this information if wanted, no need to keep copies
of these strings around.
This has several significant changes to the networking stack.
* Significant refactoring of the TCP state machine. Right now it's
probably more fragile than it used to be, but handles quite a lot
more of the handshake process.
* `TCPSocket` holds a `NetworkAdapter*`, assigned during `connect()` or
`bind()`, whichever comes first.
* `listen()` is now virtual in `Socket` and intended to be implemented
in its child classes
* `listen()` no longer works without `bind()` - this is a bit of a
regression, but listening sockets didn't work at all before, so it's
not possible to observe the regression.
* A file is exposed at `/proc/net_tcp`, which is a JSON document listing
the current TCP sockets with a bit of metadata.
* There's an `ETHERNET_VERY_DEBUG` flag for dumping packet's content out
to `kprintf`. It is, indeed, _very debug_.
Instead of generating ByteBuffers and keeping those lying around, have
these filesystems generate KBuffers instead. These are way less spooky
to leave around for a while.
Since FileDescription will keep a generated file buffer around until
userspace has read the whole thing, this prevents trivially exhausting
the kmalloc heap by opening many files in /proc for example.
The code responsible for generating each /proc file is not perfectly
efficient and many of them still use ByteBuffers internally but they
at least go away when we return now. :^)
- "seekable": whether the fd is seekable or sequential.
- "class": which kernel C++ class implements this File.
- "offset": the current implicit POSIX API file offset.
- You must now have superuser privileges to use mount().
- We now verify that the mount point is a valid path first, before
trying to find a filesystem on the specified device.
- Convert some dbgprintf() to dbg().
Processes can now have an icon assigned, which is essentially a 16x16 RGBA32
bitmap exposed as a shared buffer ID.
You set the icon ID by calling set_process_icon(int) and the icon ID will be
exposed through /proc/all.
To make this work, I added a mechanism for making shared buffers globally
accessible. For safety reasons, each app seals the icon buffer before making
it global.
Right now the first call to GWindow::set_icon() is what determines the
process icon. We'll probably change this in the future. :^)
This makes assertion failures generate backtraces again. Sorry to everyone
who suffered from the lack of backtraces lately. :^)
We share code with the /proc/PID/stack implementation. You can now get the
current backtrace for a Thread via Thread::backtrace(), and all the traces
for a Process via Process::backtrace().
The syscall is quite simple:
int watch_file(const char* path, int path_length);
It returns a file descriptor referring to a "InodeWatcher" object in the
kernel. It becomes readable whenever something changes about the inode.
Currently this is implemented by hooking the "metadata dirty bit" in
Inode which isn't perfect, but it's a start. :^)
Committing some things my hands did while browsing through this code.
- Mark all leaf classes "final".
- FileDescriptionBlocker now stores a NonnullRefPtr<FileDescription>.
- FileDescriptionBlocker::blocked_description() now returns a reference.
- ConditionBlocker takes a Function&&.
"Blocking" is not terribly informative, but now that everything is
ported over, we can force the blocker to provide us with a reason.
This does mean that to_string(State) needed to become a member, but
that's OK.
I was messing around with this to tell the compiler that these functions
always return the same value no matter how many times you call them.
It doesn't really seem to improve code generation and it looks weird so
let's just get rid of it.
This is obviously more readable. If we ever run into a situation where
ref count churn is actually causing trouble in the future, we can deal with
it then. For now, let's keep it simple. :^)
Also tweak the kernel's Makefile to use -nostdinc and -nostdinc++.
This prevents us from picking up random headers from ../Root, which may
include older versions of kernel headers.
Since we still need <initializer_list> for Vector, we specifically include
the necessary GCC path. This is a bit hackish but it works for now.
This is prep work for supporting HashMap with NonnullRefPtr<T> as values.
It's currently not possible because many HashTable functions require being
able to default-construct the value type.
Update ProcessManager, top and WSCPUMonitor to handle the new format.
Since the kernel is not allowed to use floating-point math, we now compile
the JSON classes in AK without JsonValue::Type::Double support.
To accomodate large unsigned ints, I added a JsonValue::Type::UnsignedInt.
It's kinda funny how I can make a mistake like this in Serenity and then
get so used to it by spending lots of time using this API that I start to
believe that this is how printf() always worked..
Previously the check for an empty part would happen before the
check for access and for the parent being a directory, and so an
error in those would not be detected.
If a symlink is not the last part of a path, the remaining part
of the path has to be further resolved against the symlink target.
With this, a path containing a symlink always resolves to the target
of the first (leftmost) symlink in it, for example any path of form
/proc/self/... resolves to the corresponding /proc/pid directory.
StringView character buffer is not guaranteed to be null-terminated;
in particular it will not be null-terminated when making a substring.
This means that the buffer can not be used with C functions that expect
a null-terminated string. Instead, StringView provides a convinient
operator == for comparing it with Strings and C stirngs, so use that.
This fixes /proc/self/... resolution failures in ProcFS, since the name
("self") passed to ProcFSInode::lookup() would not be null-terminated.
This significantly reduces the pressure on the kernel heap when
allocating a lot of pages.
Previously at about 250MB allocated, the free page list would outgrow
the kernel's heap. Given that there is no longer a page list, this does
not happen.
The next barrier will be the kernel memory used by the page records for
in-use memory. This kicks in at about 1GB.
After reading a bunch of POSIX specs, I've learned that a file descriptor
is the number that refers to a file description, not the description itself.
So this patch renames FileDescriptor to FileDescription, and Process now has
FileDescription* file_description(int fd).