The CAP.TO is 0 based. Even though I don't see that mentioned in the
spec explicitly, all major OSs such as Linux, FreeBSD add 1 to the
CAP.TO while calculating the timeout.
IO::delay was added as a lazy alternative to looping with a timeout
error if the condition was not satisfied. Now that we have the
wait_for_ready function, remove the delay in the reset and start
controller function.
This mostly just moved the problem, as a lot of the callers are not
capable of propagating the errors themselves, but it's a step in the
right direction.
Since NVME devices end with a digit that indicates the node index we
cannot simply append a partition index. Instead, there will be a "p"
character as separator, e.g. /dev/nvme0n1p3 for the 3rd partition.
So, if the early device name ends in a digit we need to add this
separater before matching for the partition index.
If the partition index is omitted (as is the default) the root file
system is on a disk without any partition table (e.g. using QEMU).
This enables booting from the correct partition on an NVMe drive by
setting the command line variable root to e.g. root=/dev/nvme0n1p1
We need to use the volatile keyword when mapping the device registers,
or the compiler may optimize access, which lead to this QEMU error:
pci_nvme_ub_mmiord_toosmall in nvme_mmio_read: MMIO read smaller than
32-bits, offset=0x0
Add a basic NVMe driver support to serenity
based on NVMe spec 1.4.
The driver can support multiple NVMe drives (subsystems).
But in a NVMe drive, the driver can support one controller
with multiple namespaces.
Each core will get a separate NVMe Queue.
As the system lacks MSI support, PIN based interrupts are
used for IO.
Tested the NVMe support by replacing IDE driver
with the NVMe driver :^)
This will allow File and it's descendants to use RefCounted instead of
having a custom implementation of unref. (Since RefCounted calls
will_be_destroyed automatically)
This commit also removes an erroneous call to `before_removing` in
AHCIPort, this is a duplicate call, as the only reference to the device
is immediately dropped following the call, which in turns calls
`before_removing` via File::unref.
This was a premature optimization from the early days of SerenityOS.
The eternal heap was a simple bump pointer allocator over a static
byte array. My original idea was to avoid heap fragmentation and improve
data locality, but both ideas were rooted in cargo culting, not data.
We would reserve 4 MiB at boot and only ended up using ~256 KiB, wasting
the rest.
This patch replaces all kmalloc_eternal() usage by regular kmalloc().
In order to reduce our reliance on __builtin_{ffs, clz, ctz, popcount},
this commit removes all calls to these functions and replaces them with
the equivalent functions in AK/BuiltinWrappers.h.
Some calls of copy_to_user were converting Userspace<T*> to
Userspace<U*> via the implicit conversion to FlatPtr. Change them to use
the static_ptr_cast overload that is designed to express this conversion
Instead of repeating ourselves with the pattern of waiting for some
condition to be met, we can have a general method for this task,
and then we can provide the retry count, the required delay and a lambda
function for the checked condition.
Don't use interrupts when trying to reset a device that is connected to
a port on the AHCI controller, and instead poll for changes in status to
break out from the loop. At the worst case scenario we can wait 0.01
seconds for each SATA reset.
Don't use interrupts when trying to identify a device that is connected
to a port on the AHCI controller, and instead poll for changes in status
to end the transaction.
Not only this simplifies the initialization sequence, it ensures that
for whatever reason the controller doesn't send an IRQ, we are never
getting stuck at this point.
Like what happened with the PCI and USB code, this feels like the right
thing to do because we can improve on the ATA capabilities and keep it
distinguished from the rest of the subsystem.
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
There's basically no real difference in software between a SATA harddisk
and IDE harddisk. The difference in the implementation is for the host
bus adapter protocol and registers layout.
Therefore, there's no point in putting a distinction in software to
these devices.
This change also greatly simplifies and removes stale APIs and removes
unnecessary parameters in constructor calls, which tighten things
further everywhere.
This change is another minor step towards removing `AK::String` from
the Kernel. Instead of dynamically allocating the storage_name we can
instead allocate it via a KString in the factory for each device, and
then push the device name down into the StorageDevice base class.
We don't have a way of doing `AK::String::formatted(..)` with a KString
at the moment, so cleaning that up will be left for a later day.
Previously there was a mix of returning plain strings and returning
explicit string views using `operator ""sv`. This change switches them
all to standardized on `operator ""sv` as it avoids a call to strlen.
Previously there was a mix of returning plain strings and returning
explicit string views using `operator ""sv`. This change switches them
all to standardized on `operator ""sv` as it avoids a call to strlen.
This allows us to remove the PCI::get_interrupt_line API function. As a
result, this removes a bunch of not so great patterns that we used to
cache PCI interrupt line in many IRQHandler derived classes instead of
just using interrupt_number method of IRQHandler class.
This singleton simplifies many aspects that we struggled with before:
1. There's no need to make derived classes of Device expose the
constructor as public anymore. The singleton is a friend of them, so he
can call the constructor. This solves the issue with try_create_device
helper neatly, hopefully for good.
2. Getting a reference of the NullDevice is now being done from this
singleton, which means that NullDevice no longer needs to use its own
singleton, and we can apply the try_create_device helper on it too :)
3. We can now defer registration completely after the Device constructor
which means the Device constructor is merely assigning the major and
minor numbers of the Device, and the try_create_device helper ensures it
calls the after_inserting method immediately after construction. This
creates a great opportunity to make registration more OOM-safe.
Instead of doing so in the constructor, let's do immediately after the
constructor, so we can safely pass a reference of a Device, so the
SysFSDeviceComponent constructor can use that object to identify whether
it's a block device or a character device.
This allows to us to not hold a device in SysFSDeviceComponent with a
RefPtr.
Also, we also call the before_removing method in both SlavePTY::unref
and File::unref, so because Device has that method being overrided, it
can ensure the device is removed always cleanly.
This is really a basic support for AHCI hotplug events, so we know how
to add a node representing the device in /sys/dev/block and removing it
according to the event type (insertion/removal).
This change doesn't take into account what happens if the device was
mounted or a read/write operation is being handled.
For this to work correctly, StorageManagement now uses the Singleton
container, as it might be accessed simultaneously from many CPUs
for hotplug events. DiskPartition holds a WeakPtr instead of a RefPtr,
to allow removal of a StorageDevice object from the heap.
StorageDevices are now stored and being referenced to via an
IntrusiveList to make it easier to remove them on hotplug event.
In future changes, all of the stated above might change, but for now,
this commit represents the least amount of changes to make everything
to work correctly.
These methods are no longer needed because SystemServer is able to
populate the DevFS on its own.
Device absolute_path no longer assume a path to the /dev location,
because it really should not assume any path to a Device node.
Because StorageManagement still needs to know the storage name, we
declare a virtual method only for StorageDevices to override, but this
technique should really be removed later on.
A couple of things were changed:
1. Semantic changes - PCI segments are now called PCI domains, to better
match what they are really. It's also the name that Linux gave, and it
seems that Wikipedia also uses this name.
We also remove PCI::ChangeableAddress, because it was used in the past
but now it's no longer being used.
2. There are no WindowedMMIOAccess or MMIOAccess classes anymore, as
they made a bunch of unnecessary complexity. Instead, Windowed access is
removed entirely (this was tested, but never was benchmarked), so we are
left with IO access and memory access options. The memory access option
is essentially mapping the PCI bus (from the chosen PCI domain), to
virtual memory as-is. This means that unless needed, at any time, there
is only one PCI bus being mapped, and this is changed if access to
another PCI bus in the same PCI domain is needed. For now, we don't
support mapping of different PCI buses from different PCI domains at the
same time, because basically it's still a non-issue for most machines
out there.
2. OOM-safety is increased, especially when constructing the Access
object. It means that we pre-allocating any needed resources, and we try
to find PCI domains (if requested to initialize memory access) after we
attempt to construct the Access object, so it's possible to fail at this
point "gracefully".
3. All PCI API functions are now separated into a different header file,
which means only "clients" of the PCI subsystem API will need to include
that header file.
4. Functional changes - we only allow now to enumerate the bus after
a hardware scan. This means that the old method "enumerate_hardware"
is removed, so, when initializing an Access object, the initializing
function must call rescan on it to force it to find devices. This makes
it possible to fail rescan, and also to defer it after construction from
both OOM-safety terms and hotplug capabilities.
This expands the reach of error propagation greatly throughout the
kernel. Sadly, it also exposes the fact that we're allocating (and
doing other fallible things) in constructors all over the place.
This patch doesn't attempt to address that of course. That's work for
our future selves.
The default template argument is only used in one place, and it
looks like it was probably just an oversight. The rest of the Kernel
code all uses u8 as the type. So lets make that the default and remove
the unused template argument, as there doesn't seem to be a reason to
allow the size to be customizable.
Now that the old PCI::Device was removed, we can complete the PCI
changes by making the PCI::DeviceController to be named PCI::Device.
Really the entire purpose and the distinction between the two was about
interrupts, but since this is no longer a problem, just rename it to
simplify things further.
...and also RangeAllocator => VirtualRangeAllocator.
This clarifies that the ranges we're dealing with are *virtual* memory
ranges and not anything else.
Problem:
- New `all_of` implementation takes the entire container so the user
does not need to pass explicit begin/end iterators. This is unused
except is in tests.
Solution:
- Make use of the new and more user-friendly version where possible.
We don't need to have a dedicated API for creating a VMObject with a
single page, the multi-page API option works in all cases.
Also make the API take a Span<NonnullRefPtr<PhysicalPage>> instead of
a NonnullRefPtrVector<PhysicalPage>.
The `#pragma GCC diagnostic` part is needed because the class has
virtual methods with the same name but different arguments, and Clang
tries to warn us that we are not actually overriding anything with
these.
Weirdly enough, GCC does not seem to care.
This fixes a bug that occurs when the controller's ports are not
(internally) numbered sequentially.
This is done by checking the bits set in PI.
This bug was found on bare-metal, on a laptop with 1 Port that
was reported as port 4.
If we are in a shared interrupt handler, the called handlers might
indicate it was not their interrupt, so we should not increment the
call counter of these handlers.
This has a quirk with the AMD Hudson-2 SATA controller. [1022:7801]
Having this flag set makes the controller become stuck in a busy loop.
I decided to remove the flag instead of making it a quirk as it still
works with Qemu, VirtualBox, VMware Player and the Intel Wildcat
Point-LP SATA Controller [8086:9c83] without it, thus making it simpler
to just remove it.
Partial fix for #7738 (as it still does not work in IDE mode)
This change allows the controller to utilize interrupts even if no
device was connected to a port when we initialize it, so we can support
hotplug events now.
This was proved to be a problematic option. I tested this option on
bare metal AHCI controller, and if we didn't reset the controller, the
firmware (SeaBIOS) could leave the controller state not clean, so an
plugged device signature was in place although the specific port had no
plugged device after rebooting.
Therefore, we need to ensure we use the controller in a clean state
always.
In addition to that, the Complete option was renamed to Aggressive, as
it represents better the consequences of choosing this option.
Fixes off-by-one caused by reading the register directly
without adding a 1 to it, because AHCI reports 1 less port than
the actual number of ports supported.
On my machine, it only sets PRC and not PCC.
Confirmed to happen on:
- 8086:9ca2 (Intel Corporation Wildcat Point-LP SATA Controller
[AHCI Mode] (rev 03))
On my bare metal machine, enabling it as this point causes it to
instantly send an interrupt, and we're too early in the process
to be able to handle AHCI interrupts. The interrupts were being
enabled in the initialize function anyway.
Confirmed to happen on:
- 8086:9ca2 (Intel Corporation Wildcat Point-LP SATA Controller
[AHCI Mode] (rev 03))
- 8086:3b22 (Intel Corporation 5 Series/3400 Series Chipset
6 port SATA AHCI Controller (rev 06))
AnonymousVMObject::create_with_physical_page(s) can't be NonnullRefPtr
as it allocates internally. Fixing the API then surfaced an issue in
ScatterGatherList, where the code was attempting to create an
AnonymousVMObject in the constructor which will not be observable
during OOM.
Fix all of these issues and start propagating errors at the callers
of the AnonymousVMObject and ScatterGatherList APis.
This code was unlocking the lock directly, but the Locker is still
attached, causing the lock to be unlocked an extra time, hence
corrupting the internal lock state.
This is extra confusing though, as complete_current_request() runs
without a lock which also looks like a bug. But that's a task for
another day.
We want to move this out of the AHCI subsystem into the VM system,
since other parts of the kernel may need to perform scatter-gather IO.
We rename the current VM::ScatterGatherList impl that's used in the
virtio subsystem to VM::ScatterGatherRefList, since its distinguishing
feature from the AHCI scatter-gather list is that it doesn't own its
buffers.
We had some inconsistencies before:
- Sometimes "The", sometimes "the"
- Sometimes trailing ".", sometimes no trailing "."
I picked the most common one (lowecase "the", trailing ".") and applied
it to all copyright headers.
By using the exact same string everywhere we can ensure nothing gets
missed during a global search (and replace), and that these
inconsistencies are not spread any further (as copyright headers are
commonly copied to new files).
The overall design is the same, but we change a few things,
like decreasing the amount of blocking forever loops. The goal
is to ensure the kernel won't hang forever when dealing with
buggy hardware.
Also, we reset the channel when initializing it, just in case the
hardware was in bad state before we start use it.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
The first one is for disabling the PS2 controller, the other one is for
disabling physical storage enumeration.
We can't be sure any machine will work with our implementation,
therefore this will help us to test more machines.
We need to do it to let real hardware to put the correct voltages
on the wire.
Apparently my ICH7 machine refused to boot, and was reading lots of
garbage from an unconnected IDE channel. It was fixed after I added a
delay of 20 microseconds. It probably can be reduced, I just took a safe
value and it seems to work correctly without any problems :)
Also handle native and compatibility channel modes together, so if only
one IDE channel was set to work on PCI native mode, we need to handle it
separately, so the other channel continue to operate with the legacy IO
ports and interrupt line.
This is a "quirk" I've observed on a Intel ICH7 test machine. Apparently
we need to select the device (master or slave) before starting to work
with the bus master register.
It's very possible that other machines are requiring this step to happen
before the DMA transfer can occur correctly.
Also, when reading with DMA, we should set the transfer direction before
clearing the interrupt status.
For the sake of completeness, I added a few lines in places that I
deemed it to be reasonable to clear the interrupt status there.
Although unlikely to happen, a user can have an IDE controller that
doesn't support bus master capability. If that's the case, we need to
check for this, and create an IDEChannel (not BMIDEChannel) to allow
IO operations with the controller.
If the user requests to force PIO mode, we just create IDEChannel
objects which are capable of sending PIO commands only.
However, if the user doesn't force PIO mode, we create BMIDEChannel
objects, which are sending DMA commands.
This change is somewhat simplifying the code, so each class is
supporting its type of operation - PIO or DMA. The PATADiskDevice
should not care if DMA is enabled or not.
Later on, we could write an IDEChannel class for UDMA modes,
that are available and documented on Intel specifications for their IDE
controllers.
Technically not supported by the original ATA specification, IDE
hot swapping is still in practice possible, so the only sane way
to start support it is with ref-counting the IDEChannel object so if we
remove a PATADiskDevice, it's not gone with it.
An article about IDE limits states that:
"Hard drives over 8.4 GB are supposed to report their geometry as
16383/16/63. This in effect means that the `geometry' is obsolete, and
the total disk size can no longer be computed from the geometry, but is
found in the LBA capacity field returned by the IDENTIFY command.
Hard drives over 137.4 GB are supposed to report an LBA capacity of
0xfffffff = 268435455 sectors (137438952960 bytes). Now the actual disk
size is found in the new 48-capacity field."
(https://tldp.org/HOWTO/Large-Disk-HOWTO-4.html) which is the main
reason to not support CHS as harddrives with less than 8.4 GB capacity
are completely obsolete.
Another good reason is that virtually any harddrive in the last 20 years
or so, supports LBA mode. Therefore, it's probably OK to just ignore CHS
as it's unlikely to encounter a harddrive that doesn't support LBA.
This is somewhat simplifying the IDE initialization and access code.
Also, we should use the ATAIdentifyBlock structure if possible,
so now we do it instead of using macros to calculate offsets.
With the usage of the ATAIdentifyBlock structure, we now use the
48-bit LBA max count if the drive indicates it supports 48-bit LBA mode.
This reverts commit cfc2f33dcb.
We can't actually change the IRQ line value and expect the device
to work with it (this was my mistake).
That register is R/W so the firmware can figure out IRQ routing and put
the correct value and write it to the Interrupt line register.
As a compromise, if the fimrware decided to set the IRQ line to be 7,
or something else we can't deal with, the user can simply force the code
to work with IRQ 11, with the boot argument "force_ahci_irq_11" being
set to "on".
Instead of polling if the device ended the operation, we can just use
interrupts for signalling about end of IO operation.
In similar way, we use interrupts during device detection.
Also, we use the new Work Queue mechanism introduced by @tomuta to allow
better performance and stability :)
We can't use deferred functions for anything that may require preemption,
such as copying from/to user or accessing the disk. For those purposes
we should use a work queue, which is essentially a kernel thread that
may be preempted or blocked.
These errors are classed as fatal, so we need to recover from them.
Found while trying to debug AHCI boot on VMware Player,
where I got TFES.
From the spec: "Fatal errors (signified by the setting of PxIS.HBFS,
PxIS.HBDS, PxIS.IFS, or PxIS.TFES) will cause the HBA to enter
the ERR:Fatal state"
We were already recovering from IFS.
Instead of blindly resetting every AHCI port, let's just reset only the
controller by default. The user can still request to reset everything
with a new kernel boot argument called ahci_reset_mode which is set
by default to "controller", so the code will only invoke an HBA reset.
This kernel boot argument can be set to 3 different values:
1. "controller" - reset the HBA and skip resetting AHCI ports
2. "none" - don't reset anything, so we rely on the firmware to
initialize the AHCI HBA and ports for us.
3. "complete" - reset the AHCI HBA and ports.
Instead of waiting for the AHCI HBA to reset the signature after SATA
reset sequence, let's just check if the Port x Serial ATA Status
register was set to value 3, indicating that device was detected
and phy communication was established.
The intention is to make the boot to be faster, therefore we should
decrease the time deltas in timeout loops to allow earlier break
from these.
Also, there's no need to wait 10 milliseconds before setting
the interface state to "no action request" during the reset sequence.
This class is used in the AHCI code to handle a big request of
read/write to the disk. If we happen to encounter such request,
we will get the needed amount of physical pages from the
already-allocated physical pages in AHCIPort, and with that we
will create a ScatterList that will create a Region that maps
all of these pages in a contiguous virtual memory range.
Then, we could easily copy to/from this range, before and after
calling the operation on the StorageDevice as needed with
read or write operations.
The hierarchy is AHCIController, AHCIPortHandler, AHCIPort and
SATADiskDevice. Each AHCIController has at least one AHCIPortHandler.
An AHCIPortHandler is an interrupt handler that takes care of
enumeration of handled AHCI ports when an interrupt occurs. Each
AHCIPort takes care of one SATADiskDevice, and later on we can add
support for Port multiplier.
When we implement support of Message signalled interrupts, we can spawn
many AHCIPortHandlers, and allow each one of them to be responsible for
a set of AHCIPorts.
Previously all of the CommandLine parsing was spread out around the
Kernel. Instead move it all into the Kernel CommandLine class, and
expose a strongly typed API for querying the state of options.
This may seem like a no-op change, however it shrinks down the Kernel by a bit:
.text -432
.unmap_after_init -60
.data -480
.debug_info -673
.debug_aranges 8
.debug_ranges -232
.debug_line -558
.debug_str -308
.debug_frame -40
With '= default', the compiler can do more inlining, hence the savings.
I intentionally omitted some opportunities for '= default', because they
would increase the Kernel size.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
If we try to align a number above 0xfffff000 to the next multiple of
the page size (4 KiB), it would wrap around to 0. This is most likely
never what we want, so let's assert if that happens.
This change can be actually seen as two logical changes, the first
change is about to ensure we only read the ATA Status register only
once, because if we read it multiple times, we acknowledge interrupts
unintentionally. To solve this issue, we always use the alternate Status
register and only read the original status register in the IRQ handler.
The second change is how we handle interrupts - if we use DMA, we can
just complete the request and return from the IRQ handler. For PIO mode,
it's more complicated. For PIO write operation, after setting the ATA
registers, we send out the data to IO port, and wait for an interrupt.
For PIO read operation, we set the ATA registers, and wait for an
interrupt to fire, then we just read from the data IO port.
This replaces the current disk detection and disk access code with
code based on https://wiki.osdev.org/IDE
This allows the system to boot on VirtualBox with serial debugging
enabled and VMWare Player.
I believe there were several issues with the current code:
- It didn't utilise the last 8 bits of the LBA in 24-bit mode.
- {read,write}_sectors_with_dma was not setting the obsolete bits,
which according to OSdev wiki aren't used but should be set.
- The PIO and DMA methods were using slightly different copy
and pasted access code, which is now put into a single
function called "ata_access"
- PIO mode doesn't work. This doesn't fix that and should
be looked into in the future.
- The detection code was not checking for ATA/ATAPI.
- The detection code accidentally had cyls/heads/spt as 8-bit,
when they're 16-bit.
- The capabilities of the device were not considered. This is now
brought in and is currently used to check if the device supports
LBA. If not, use CHS.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
Since devices are enumerable and can compute their own name inside the
/dev hierarchy, there is no need to try and parse "root=/dev/xxx" by
hand.
This also makes any block device a candidate for the boot device, which
now includes ramdisk devices, so SerenityOS can now boot diskless too.
The disk image generated for QEMU is suitable, as long as it fits in
memory with room to spare for the rest of the system.
Besides removing the monolithic DevFSDeviceInode::determine_name()
method, being able to determine a device's name inside the /dev
hierarchy outside of DevFS has its uses.