diff --git a/eden/docs/InodeLifetime.md b/eden/docs/InodeLifetime.md new file mode 100644 index 0000000000..42f3d4ce1c --- /dev/null +++ b/eden/docs/InodeLifetime.md @@ -0,0 +1,219 @@ +Inode Ownership +=============== + +Inodes are managed via `InodePtr` objects. `InodePtr` is a smart-pointer class +that maintains a reference count on the underlying `InodeBase` object, similar +to `std::shared_ptr`. + +However, unlike `std::shared_ptr` Inodes are not necessarily deleted +immediately when their reference count drops to zero. Instead they may remain +in memory for a while in case they are used again soon. + +Owners +------ + +- `InodeMap` holds a reference to the root inode. This ensures that the root + inode remains in existence for as long as the `EdenMount` exists. + +- Each Inode holds a reference to its parent `TreeInode`. This ensures that + if an inode exists, all of its parents all the way to the mount point root + also exist. + +- For all other call sites, callers obtain a reference to an Inode when they + look it up. The lookup functions return `InodePtr` objects that the call + site should retain for as long as they need access to the Inode. + +Non-Owners +---------- + +- `InodeMap` does not hold a reference to the inodes it contains. Otherwise + it would never be possible to unload or destroy any inodes. Instead the + `InodeMap` holds raw pointers to Inode objects. When `Inode` objects are + unloaded they are always explicitly removed from the `InodeMap`'s list of + loaded inodes. + +- A `TreeInode` does not hold a reference to any of its children. Otherwise + this would cause circular reference, since each child holds a reference to + its parent `TreeInode`. The `TreeInode` is always explicitly informed when + one of its children inodes is unloaded, so it can remove the raw pointer to + the child from its child entries map. + + +Inode Lookup +============ + +Inodes may be looked up in one of two ways, either by name or by inode number. +`TreeInode::getOrLoadChild()` is the API for doing Inode lookups by name, +and `InodeMap::lookupInode()` is the API for doing Inode lookups by inode +number. + +Either of these two APIs may have to create the Inode object. Alternatively, +if the specified Inode already exists they will increment the reference count +to the existing object and return it. It is possible the Inode is already +present in the `InodeMap` but was previously unreferenced, so these APIs may +increment the reference count from 0 to 1. + +Simultaneous Lookups +-------------------- + +The `InodeMap` class keeps track of all currently loaded Inodes as well as +information about Inodes that have inode numbers allocated but are not loaded. +For each unloaded inode, `InodeMap` records if it is currently being loaded. +This allows `InodeMap` to avoid starting two load attempts for the same inode. +If a second lookup attempt occurs for an Inode already being loaded, `InodeMap` +handles notifying both waiting callers when the single load attempt completes. + + +Inode Unloading +=============== + +Inode unloading can be triggered by several events: + +## Inode reference count going to zero + +When the inode reference count drops to zero we have a chance to decide if we +want to unload the inode or not. + +When shutting down the mount point we always destroy each Inode as soon as its +reference count goes to zero. + +If the inode is unlinked and its FUSE reference count is also zero we also +destroy the inode immediately. + +In other cases we generally leave the Inode object loaded, but it would be +valid to decide to unload it based on other criteria. (For instance, we could +decide to immediately unload unreferenced inodes if we are low on memory.) + +## FUSE reference count going to zero + +When the FUSE reference count goes to zero we should destroy the inode +immediately if it is unlinked and its pointer reference count is also zero. + +To simplify synchronization, we currently collapse this case into the one +above: we only decrement the FUSE reference count on a loaded Inode when we are +holding a normal `InodePtr` reference to the Inode. Therefore we will always +see the normal reference count drop to zero at some point after the FUSE +reference count drops to zero, and we process the unload at that time. + +## On demand + +We will likely add a periodic background task to unload unreferenced inodes +that have not been accessed in some time. This unload operation could also be +triggered in response to other events (for instance, a thrift call, or going +over some memory usage limit). + +Synchronization and the Acquire Count +------------------------------------- + +Synchronization of Inode loading and unloading is slightly tricky, particularly +for unloading. + +### Loading + +When loading an inode, we always hold the `InodeMap` lock to check if the inode +in question is already loaded or if a load is in progress. Once the inode is +loaded we acquire its parent `TreeInode`'s `contents_` lock, then the +`InodeMap` lock (in that order), so we can insert the inode into it's parent's +entry list and into the `InodeMap`'s list of loaded inodes. + +### Updating Reference Counts + +`InodePtr` itself does not hold any extra locks when performing reference +count updates. The main Inode reference count is updated with atomic +operations, but without any other locks held. + +However, there is one important item to note here: updates done via `InodePtr` +copying can never increment the reference count from 0 to 1. The lookup APIs +(`TreeInode::getOrLoadChild()` and `InodeMap::lookupInode()`) are the only two +places that can ever increment the reference count from 0 to 1. Both of these +lookup APIs hold a lock when potentially updating the reference count from 0 to +1. + +`TreeInode::getOrLoadChild()` holds the parent `TreeInode`'s `contents_` lock, +and `InodeMap::lookupInode()` holds the `InodeMap` lock. This means that if +you hold both of these locks and you see that an Inode's reference count is +currently 0, no other thread can acquire a reference count to that Inode. + +### Preventing Multiple Unload Attempts + +Holding the parent `TreeInode`'s `contents_` lock and the `InodeMap` lock +ensures that no other thread can acquire a new reference on an Inode, but that +alone does not mean it is safe to destroy the inode. We still need to prevent +multiple threads from both trying to destroy an Inode. + +For instance, consider if thread A destroys the last `InodePtr` to an Inode, +dropping its reference count to 0. However, before thread A has a chance to +grab the `TreeInode` and `InodeMap` locks and decide if it wants to unload the +inode, thread B looks up the inode, increasing the reference count from 0 to 1, +but then immediately destroys its `InodePtr`, dropping the reference count back +to 0. + +In this situation thread A and thread B have both just dropped the reference +count to 0. We need to make sure that only one of these two threads can try to +destroy the inode. + +This is achieved through another counter, called the "acquire" counter. +This counter is incremented each time the Inode reference count goes from 0 to +1, and decremented each time the reference count goes from 1 to 0. However, +unlike the main reference count, the acquire counter is only modified while +holding some additional locks. + +Increments to the acquire counter are only done while holding either the +parent `TreeInode`'s `contents_` lock (in the case of +`TreeInode::getOrLoadChild()`) or the `InodeMap` lock (in the case of +`InodeMap::lookupInode()`). + +Decrements to the acquire counter are only done while holding both the +parent `TreeInode`'s `contents_` lock and the `InodeMap` lock. + +When thread A and thread B both see that the main reference count drops to 0, +they both attempt to acquire both the `TreeInode` and `InodeMap` locks. +Whichever thread acquires the locks first will see that the acquire count is +non-zero (since both threads incremented it when bumping the main reference +count from 0 to 1). This thread decrements the acquire count and does nothing +else since the acquire count is non zero. The second thread can then acquire +the locks, decrement the acquire count and see that it is now zero. This +second thread can then perform the unload (while still holding both locks). + +EdenMount Destruction +===================== + +All Inode objects store a pointer to the `EdenMount` that they are a part of. +This means that the `EdenMount` itself cannot be destroyed until all of its +Inodes are destroyed. + +We achieve this via the root `TreeInode`'s reference count. During normal +operation the `EdenMount` holds a reference to the root `TreeInode`. +(Technically the `InodeMap` holds the reference, but the `EdenMount` owns the +`InodeMap`.) When the `EdenMount` needs to be destroyed, we release the +reference count on the root inode. When the root inode becomes unreferenced we +know that all of its children have been destroyed, and it is now safe to +destroy the `EdenMount` object itself. + +All of this is triggered through the `EdenMount::destroy()` function. This +function marks the mount as shutting down, which causes the `InodeMap` to +immediately unload any Inodes that become newly unreferenced. We then trigger +an immediate unload scan to unload any Inodes that were already unreferenced. +Once this is done we release the `InodeMap`'s reference count on the root +inode, allowing it to become unreferenced once all of its children are +destroyed. + + +FUSE Reference Counts +===================== + +In addition to the reference count tracking how many `InodePtr` objects are +currently referring to an inode, `InodeBase` also keeps track of how +outstanding references to this inode from the FUSE layer. (This is the number +of `lookup()`/`create()`/`mkdir()`/`symlink()`/`link()` calls made for this +inode, minus the number of times it was forgotten via `forget()`.) + +However, the FUSE reference count is not directly related to the Inode object +lifetime. + +Inode objects may be unloaded even when the FUSE reference count is non-zero. +In this case the `InodeMap` retains enough information needed to re-create the +`Inode` object if the inode number is later looked up again by the FUSE API. + +The FUSE reference count is only adjusted while holding a normal InodePtr +reference to the Inode.