sapling/eden/docs/InodeLifetime.md
Adam Simpkins b23b73c907 add document describing Inode lifetime management
Summary:
This describes the behavior around Inode ownership, loading and unloading, as
well as the synchronization around these events.

The InodeMap and InodePtr diffs implementing this behavior are still in flight.

Reviewed By: bolinfest

Differential Revision: D4356591

fbshipit-source-id: 298ba2c6aa98feba8b90f858e5633ca793276a1e
2016-12-22 19:01:12 -08:00

220 lines
9.8 KiB
Markdown

Inode Ownership
===============
Inodes are managed via `InodePtr` objects. `InodePtr` is a smart-pointer class
that maintains a reference count on the underlying `InodeBase` object, similar
to `std::shared_ptr`.
However, unlike `std::shared_ptr` Inodes are not necessarily deleted
immediately when their reference count drops to zero. Instead they may remain
in memory for a while in case they are used again soon.
Owners
------
- `InodeMap` holds a reference to the root inode. This ensures that the root
inode remains in existence for as long as the `EdenMount` exists.
- Each Inode holds a reference to its parent `TreeInode`. This ensures that
if an inode exists, all of its parents all the way to the mount point root
also exist.
- For all other call sites, callers obtain a reference to an Inode when they
look it up. The lookup functions return `InodePtr` objects that the call
site should retain for as long as they need access to the Inode.
Non-Owners
----------
- `InodeMap` does not hold a reference to the inodes it contains. Otherwise
it would never be possible to unload or destroy any inodes. Instead the
`InodeMap` holds raw pointers to Inode objects. When `Inode` objects are
unloaded they are always explicitly removed from the `InodeMap`'s list of
loaded inodes.
- A `TreeInode` does not hold a reference to any of its children. Otherwise
this would cause circular reference, since each child holds a reference to
its parent `TreeInode`. The `TreeInode` is always explicitly informed when
one of its children inodes is unloaded, so it can remove the raw pointer to
the child from its child entries map.
Inode Lookup
============
Inodes may be looked up in one of two ways, either by name or by inode number.
`TreeInode::getOrLoadChild()` is the API for doing Inode lookups by name,
and `InodeMap::lookupInode()` is the API for doing Inode lookups by inode
number.
Either of these two APIs may have to create the Inode object. Alternatively,
if the specified Inode already exists they will increment the reference count
to the existing object and return it. It is possible the Inode is already
present in the `InodeMap` but was previously unreferenced, so these APIs may
increment the reference count from 0 to 1.
Simultaneous Lookups
--------------------
The `InodeMap` class keeps track of all currently loaded Inodes as well as
information about Inodes that have inode numbers allocated but are not loaded.
For each unloaded inode, `InodeMap` records if it is currently being loaded.
This allows `InodeMap` to avoid starting two load attempts for the same inode.
If a second lookup attempt occurs for an Inode already being loaded, `InodeMap`
handles notifying both waiting callers when the single load attempt completes.
Inode Unloading
===============
Inode unloading can be triggered by several events:
## Inode reference count going to zero
When the inode reference count drops to zero we have a chance to decide if we
want to unload the inode or not.
When shutting down the mount point we always destroy each Inode as soon as its
reference count goes to zero.
If the inode is unlinked and its FUSE reference count is also zero we also
destroy the inode immediately.
In other cases we generally leave the Inode object loaded, but it would be
valid to decide to unload it based on other criteria. (For instance, we could
decide to immediately unload unreferenced inodes if we are low on memory.)
## FUSE reference count going to zero
When the FUSE reference count goes to zero we should destroy the inode
immediately if it is unlinked and its pointer reference count is also zero.
To simplify synchronization, we currently collapse this case into the one
above: we only decrement the FUSE reference count on a loaded Inode when we are
holding a normal `InodePtr` reference to the Inode. Therefore we will always
see the normal reference count drop to zero at some point after the FUSE
reference count drops to zero, and we process the unload at that time.
## On demand
We will likely add a periodic background task to unload unreferenced inodes
that have not been accessed in some time. This unload operation could also be
triggered in response to other events (for instance, a thrift call, or going
over some memory usage limit).
Synchronization and the Acquire Count
-------------------------------------
Synchronization of Inode loading and unloading is slightly tricky, particularly
for unloading.
### Loading
When loading an inode, we always hold the `InodeMap` lock to check if the inode
in question is already loaded or if a load is in progress. Once the inode is
loaded we acquire its parent `TreeInode`'s `contents_` lock, then the
`InodeMap` lock (in that order), so we can insert the inode into it's parent's
entry list and into the `InodeMap`'s list of loaded inodes.
### Updating Reference Counts
`InodePtr` itself does not hold any extra locks when performing reference
count updates. The main Inode reference count is updated with atomic
operations, but without any other locks held.
However, there is one important item to note here: updates done via `InodePtr`
copying can never increment the reference count from 0 to 1. The lookup APIs
(`TreeInode::getOrLoadChild()` and `InodeMap::lookupInode()`) are the only two
places that can ever increment the reference count from 0 to 1. Both of these
lookup APIs hold a lock when potentially updating the reference count from 0 to
1.
`TreeInode::getOrLoadChild()` holds the parent `TreeInode`'s `contents_` lock,
and `InodeMap::lookupInode()` holds the `InodeMap` lock. This means that if
you hold both of these locks and you see that an Inode's reference count is
currently 0, no other thread can acquire a reference count to that Inode.
### Preventing Multiple Unload Attempts
Holding the parent `TreeInode`'s `contents_` lock and the `InodeMap` lock
ensures that no other thread can acquire a new reference on an Inode, but that
alone does not mean it is safe to destroy the inode. We still need to prevent
multiple threads from both trying to destroy an Inode.
For instance, consider if thread A destroys the last `InodePtr` to an Inode,
dropping its reference count to 0. However, before thread A has a chance to
grab the `TreeInode` and `InodeMap` locks and decide if it wants to unload the
inode, thread B looks up the inode, increasing the reference count from 0 to 1,
but then immediately destroys its `InodePtr`, dropping the reference count back
to 0.
In this situation thread A and thread B have both just dropped the reference
count to 0. We need to make sure that only one of these two threads can try to
destroy the inode.
This is achieved through another counter, called the "acquire" counter.
This counter is incremented each time the Inode reference count goes from 0 to
1, and decremented each time the reference count goes from 1 to 0. However,
unlike the main reference count, the acquire counter is only modified while
holding some additional locks.
Increments to the acquire counter are only done while holding either the
parent `TreeInode`'s `contents_` lock (in the case of
`TreeInode::getOrLoadChild()`) or the `InodeMap` lock (in the case of
`InodeMap::lookupInode()`).
Decrements to the acquire counter are only done while holding both the
parent `TreeInode`'s `contents_` lock and the `InodeMap` lock.
When thread A and thread B both see that the main reference count drops to 0,
they both attempt to acquire both the `TreeInode` and `InodeMap` locks.
Whichever thread acquires the locks first will see that the acquire count is
non-zero (since both threads incremented it when bumping the main reference
count from 0 to 1). This thread decrements the acquire count and does nothing
else since the acquire count is non zero. The second thread can then acquire
the locks, decrement the acquire count and see that it is now zero. This
second thread can then perform the unload (while still holding both locks).
EdenMount Destruction
=====================
All Inode objects store a pointer to the `EdenMount` that they are a part of.
This means that the `EdenMount` itself cannot be destroyed until all of its
Inodes are destroyed.
We achieve this via the root `TreeInode`'s reference count. During normal
operation the `EdenMount` holds a reference to the root `TreeInode`.
(Technically the `InodeMap` holds the reference, but the `EdenMount` owns the
`InodeMap`.) When the `EdenMount` needs to be destroyed, we release the
reference count on the root inode. When the root inode becomes unreferenced we
know that all of its children have been destroyed, and it is now safe to
destroy the `EdenMount` object itself.
All of this is triggered through the `EdenMount::destroy()` function. This
function marks the mount as shutting down, which causes the `InodeMap` to
immediately unload any Inodes that become newly unreferenced. We then trigger
an immediate unload scan to unload any Inodes that were already unreferenced.
Once this is done we release the `InodeMap`'s reference count on the root
inode, allowing it to become unreferenced once all of its children are
destroyed.
FUSE Reference Counts
=====================
In addition to the reference count tracking how many `InodePtr` objects are
currently referring to an inode, `InodeBase` also keeps track of how
outstanding references to this inode from the FUSE layer. (This is the number
of `lookup()`/`create()`/`mkdir()`/`symlink()`/`link()` calls made for this
inode, minus the number of times it was forgotten via `forget()`.)
However, the FUSE reference count is not directly related to the Inode object
lifetime.
Inode objects may be unloaded even when the FUSE reference count is non-zero.
In this case the `InodeMap` retains enough information needed to re-create the
`Inode` object if the inode number is later looked up again by the FUSE API.
The FUSE reference count is only adjusted while holding a normal InodePtr
reference to the Inode.