2018-04-11 01:21:59 +03:00
|
|
|
# Inode Storage Overview
|
|
|
|
|
|
|
|
## How Durable is Eden?
|
|
|
|
|
|
|
|
We have some guiding principles that affect the design of Eden and its
|
|
|
|
durability properties.
|
|
|
|
|
|
|
|
We intend for Eden to reliably preserve user data if the Eden processes aborts
|
2018-04-13 22:23:38 +03:00
|
|
|
or is killed. If the process dies, none of the user's data should be lost.
|
|
|
|
Eden crashing ought to be rare, but, especially while it's in development, it's
|
|
|
|
realistic to expect things to go wrong, including stray `killall edenfs`
|
|
|
|
commands.
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
However, we do not guarantee consistent data if a VM suddenly powers off
|
|
|
|
or if a disk fails. It is a substantial amount of work, and probably a
|
|
|
|
performance penalty, to be durable under those conditions.
|
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
Fortunately, thanks to commit cloud, the risk of losing days of work due to disk
|
2018-04-11 01:21:59 +03:00
|
|
|
or machine shutdown is low. While many engineer-hours will be spent working in
|
|
|
|
an Eden checkout, the amount of work that builds up prior to a commit is
|
2018-04-13 22:23:38 +03:00
|
|
|
hopefully bounded. (And perhaps someday we will automatically snapshot your
|
|
|
|
working copy!)
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
## Concepts
|
|
|
|
|
|
|
|
Git and Mercurial have abstract, hash-indexed tree data structures representing
|
|
|
|
a file hierarchy. (You'll find the corresponding code in `eden/fs/model`.)
|
2018-04-13 22:23:38 +03:00
|
|
|
Version control trees and files have a subset of the possible states that a real
|
|
|
|
filesystem can be in. For example, neither Git nor Mercurial version a file's
|
|
|
|
user or group ownership, and the only versioned permission bit is
|
|
|
|
user-executable. Also, version control systems do not support hard links.
|
2018-04-11 01:21:59 +03:00
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
In a non-Eden, traditional version control system, checkout operations
|
|
|
|
immediately materialize that abstract tree data structure into actual
|
|
|
|
directories and files on disk. The downside of course is that checkout becomes
|
|
|
|
O(repo) in disk operations and the entire tree is physically allocated on disk.
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
What makes Eden useful is that it only fetches trees and blobs from version
|
|
|
|
control as the filesystem is explored. This makes checkout O(changes). But it
|
|
|
|
raises some questions about how to expose traditional filesystem concepts like
|
|
|
|
timestamps, permission bits, and inode numbers.
|
|
|
|
|
|
|
|
## Inode States
|
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
As the filesystem is explored through FUSE, inodes are allocated to represent a
|
|
|
|
accessed source control trees and files. A given inode can then transition
|
|
|
|
between states as filesystem operations are performed on it.
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
### Metadata State Machine
|
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
Eden inodes transition between a series of states:
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
Once the parent tree has been loaded, the names, types, and hashes of its
|
|
|
|
children are known. At this point, questions like "does this entry exist?" or
|
2018-04-13 22:23:38 +03:00
|
|
|
"what is its hash?" can be answered, in addition to providing any metadata we
|
2018-04-11 01:21:59 +03:00
|
|
|
have from the backing version control system. (For example, Mononoke will
|
|
|
|
provide file sizes and SHA-1 hashes so Eden does not have to actually load the
|
|
|
|
files and compute them.)
|
|
|
|
|
|
|
|
To satisfy readdir() or stat() calls, however, we must give the entry an inode
|
|
|
|
number. Once an inode number has been allocated to an entry and handed out via
|
|
|
|
the filesystem, it must be remembered as long as programs can reasonably expect
|
|
|
|
them to be consistent. (e.g. for the program's lifetime or until a qualifying
|
|
|
|
"anything could happen" operation like `hg checkout`. See `#pragma
|
|
|
|
once` addendum below.)
|
|
|
|
|
|
|
|
Inode metadata such as timestamps and permission bits, once accessed, should be
|
|
|
|
remembered as long as the inode numbers are. See `make` addendum below. When
|
2018-04-13 22:23:38 +03:00
|
|
|
Eden forgets an inode number, the timestamps and permission bits are forgotten
|
|
|
|
too. Moreover, when the inode number is forgotten, the inode numbers of its
|
|
|
|
children must be forgotten.
|
|
|
|
|
|
|
|
There is only one type of inode metadata change that matters from the
|
|
|
|
perspective of version control: the user executable bit on files. If that bit
|
|
|
|
changes, the file and all of its parents must be marked potentially-modified.
|
|
|
|
Other metadata changes are local-only and can be ignored by version control
|
2018-04-11 01:21:59 +03:00
|
|
|
operations.
|
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
At the risk of repeating myself, here are some other rules. If a source control
|
|
|
|
tree entry has an inode number, its parent must also have an inode number. If
|
|
|
|
an inode is marked potentially-modified, its parent must also be marked
|
|
|
|
potentially-modified. Why? Because Eden needs to be able to crawl from the root
|
|
|
|
tree and rapidly enumerate the potentially-modified set, even at process
|
|
|
|
startup.
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
During a checkout operation (or otherwise) we may determine that the contents of
|
|
|
|
a file or tree now matches its unmodified state. If so, to reduce the size of
|
2018-04-13 22:23:38 +03:00
|
|
|
the tree Eden is tracking, it may dematerialize the tree (from the parents
|
|
|
|
down). Dematerialization must preserve inode numbers for any entries that may
|
|
|
|
currently be referenced by FUSE, but since checkout is an "anything could
|
|
|
|
happen" operation, inodes for other unmodified files could be forgotten.
|
2018-04-11 01:21:59 +03:00
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
For our own sanity, Eden should never hand out duplicate inode numbers.
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
### Data State Machine
|
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
The previous section talks about inode numbers and inode metadata (e.g.
|
|
|
|
timestamps, user, group, and mode bits).
|
2018-04-11 01:21:59 +03:00
|
|
|
|
|
|
|
The other half of an inode is its data: the contents of a file (or symlink) and
|
|
|
|
the entries of a tree. (Note that it's possible for an inode's data to be
|
|
|
|
modified but metadata untouched or vice versa.)
|
|
|
|
|
|
|
|
When an entry's parent is loaded, the child's name, type, and hash are known,
|
|
|
|
and read operations can be satisfied by fetching a blob of data from the backing
|
|
|
|
store.
|
|
|
|
|
|
|
|
When an inode's data is modified, its state transitions to materialized.
|
|
|
|
|
|
|
|
If the modification results in a blank file (e.g. O_TRUNC), Eden doesn't even
|
|
|
|
need to wait for the blob to finish loading.
|
|
|
|
|
|
|
|
For other types of modifications, such as writes, data must be fetched from
|
|
|
|
source control before the modifications can be applied to it.
|
|
|
|
|
|
|
|
For fast status and diff operations, Eden needs to rapidly find all materialized
|
|
|
|
entries, so its parent must then be marked materialized as well (all the way to
|
|
|
|
the root).
|
|
|
|
|
|
|
|
Note that the materialized state of an inode is independent of whether it has
|
|
|
|
been modified from the contents of the file in the current source control
|
|
|
|
commit. If a non-materialized file is renamed it will still be non-materialized,
|
|
|
|
but it will be different from the current commit contents at its location.
|
|
|
|
Conversely, a file can be rewritten with contents that are identical to the
|
|
|
|
current source control state. The process of writing it will generally leave it
|
|
|
|
in a materialized state, even though it may be the same as the current source
|
|
|
|
control state at the end.
|
|
|
|
|
2018-04-13 22:23:38 +03:00
|
|
|
### What Does Materialized Mean?
|
|
|
|
|
|
|
|
[TODO: not sure where to put this section]
|
|
|
|
|
|
|
|
This document talks about an inode entering and leaving the 'materialized'
|
|
|
|
state. It's a bit of an unintuitive concept. If an inode is materialized,
|
|
|
|
it is potentially modified relative to its original source control object, as
|
|
|
|
indicated by its parent's entry's source control hash.
|
|
|
|
|
|
|
|
Note that being materialized is orthogonal to whether a file is considered
|
|
|
|
modified or not. If a file has been overwritten with its original contents, it
|
|
|
|
will be materialized (at least temporarily) but not show up as modified from the
|
|
|
|
perspective of version control. On the other hand, if a subtree has been
|
|
|
|
renamed (imagine root/foo -> root/bar), then everything inside the subtree will
|
|
|
|
not be materialized, but will show up as modified from a status or diff
|
|
|
|
operation.
|
|
|
|
|
|
|
|
If an inode is materialized, its parent must also be materialized. The
|
|
|
|
materialized status is used to rapidly determine which set of files is worth
|
|
|
|
looking at when performing a status or diff operation.
|
|
|
|
|
2018-04-11 01:21:59 +03:00
|
|
|
## Concrete Storage
|
|
|
|
|
|
|
|
How is all of this represented inside Eden and how do state transitions meet our
|
|
|
|
durability goals above?
|
|
|
|
|
|
|
|
### InodeMap
|
|
|
|
|
|
|
|
The InodeMap keeps track of loaded inodes and inodes that FUSE still has a
|
|
|
|
reference to.
|
|
|
|
|
|
|
|
Note that the term "loaded" is used ambiguously in Eden. When talking about
|
|
|
|
whether an inode is loaded, it means that the InodeMap has in-memory data
|
|
|
|
tracking its state. On the other hand, a FileInode can have loaded its backing
|
|
|
|
blob or not.
|
|
|
|
|
|
|
|
(TODO: should we rename InodeMap's "loaded" and "unloaded" terminology to
|
|
|
|
"known" and "remembered"?)
|
|
|
|
|
|
|
|
#### loadedInodes_
|
|
|
|
|
|
|
|
Inode tree nodes currently loaded in memory.
|
|
|
|
|
|
|
|
* For files, that includes their hashes, blob loading state, file handles
|
|
|
|
into the overlay, timestamps, and permission bits.
|
|
|
|
* For trees, that includes tree hashes, entries, timestamps.
|
|
|
|
* For both, the entry type, fuse reference count, internal reference count,
|
|
|
|
location.
|
|
|
|
* If a child is in loadedInodes_, its parent must be in loadedInodes_ too.
|
|
|
|
|
|
|
|
#### unloadedInodes_
|
|
|
|
|
|
|
|
In-memory map from inode number to remembered inode state. When an inode is
|
|
|
|
unloaded, if it has a nonzero FUSE reference count, it is registered into this
|
|
|
|
table, which contains:
|
|
|
|
|
|
|
|
* its FUSE refcount
|
|
|
|
* its hash (if not materialized)
|
|
|
|
* its permission bits
|
|
|
|
* parent inode number and child name (if not unlinked)
|
|
|
|
|
|
|
|
If a child is in unloadedInodes_, its parent must be in unloadedInodes_ too.
|
|
|
|
|
|
|
|
An inode cannot be in both loadedInodes_ and unloadedInodes_ at the same time.
|
|
|
|
|
|
|
|
If an inode has a nonzero FUSE reference count, it should exist in either
|
|
|
|
loadedInodes_ or unloadedInodes_.
|
|
|
|
|
|
|
|
#### Overlay
|
|
|
|
|
|
|
|
The Overlay is an on-disk map from inode number to its timestamps plus the
|
|
|
|
file's or tree's contents.
|
|
|
|
|
|
|
|
If a tree's child entry does not have a hash (that is, it's marked as
|
|
|
|
materialized), then data for that inode must be in the overlay. Because of this
|
|
|
|
invariant, we must write the child's overlay data prior to setting it
|
|
|
|
materialized in the parent. When dematerializing, we must mark the child as
|
|
|
|
dematerialized in the parent before deleting the child's overlay data, in case
|
|
|
|
the process crashes in between those two operations.
|
|
|
|
|
|
|
|
### InodeMap State Transitions
|
|
|
|
|
|
|
|
[This section may be incomplete.]
|
|
|
|
|
|
|
|
Unknown ⟶ Loading:
|
|
|
|
- (First, load parent.)
|
|
|
|
- If parent has this entry marked materialized, load child from overlay and
|
|
|
|
immediately transition to loaded. Otherwise...
|
|
|
|
- Insert entry in unloadedInodes_
|
|
|
|
- Begin fetching object from ObjectStore
|
|
|
|
|
|
|
|
Loading ⟶ Loaded:
|
|
|
|
- If this is a tree, when the load completes, check the overlay.
|
|
|
|
- The overlay might have some remembered inode numbers.
|
|
|
|
- TODO: if eden crashed while materializing up a tree, that state needs to be
|
|
|
|
corrected or dropped here.
|
|
|
|
- Construct Inode type
|
|
|
|
- Remove from unloadedInodes_ and insert into loadedInodes_
|
|
|
|
|
|
|
|
Loaded ⟶ Unloaded:
|
|
|
|
- If the mount is being unmounted
|
|
|
|
- If unlinked, remove it from the overlay (it can never be accessed again)
|
|
|
|
- Otherwise, update metadata in Overlay
|
|
|
|
- Otherwise (we probably need to remember the inode number)
|
|
|
|
- If unlinked, remove it from the overlay
|
|
|
|
- Otherwise,
|
|
|
|
- If fuseCount is nonzero, insert inode in unloadedInodes_
|
|
|
|
- If inode is a tree and any of its children are in unloadedInodes_,
|
|
|
|
insert inode in unloadedInodes_
|
|
|
|
- Otherwise... forget everything about the inode.
|
|
|
|
|
|
|
|
### TreeInode State Machine
|
|
|
|
|
|
|
|
TreeInode can only make two state transitions:
|
|
|
|
|
|
|
|
Unmaterialized ⟶ Materialized:
|
|
|
|
- When a tree is modified, it is marked materialized (recursively up the tree)
|
|
|
|
- Its contents are written to the Overlay
|
|
|
|
|
|
|
|
Materialized ⟶ Unmaterialized:
|
|
|
|
- When Eden notices the entries match the backing source control Tree, and it
|
|
|
|
has no materialized children, it is marked dematerialized.
|
|
|
|
- Note that the Tree's parent must be updated prior to removing the child's
|
|
|
|
overlay data.
|
|
|
|
|
|
|
|
### FileInode State Machine
|
|
|
|
|
|
|
|
FileInode's transitions are relatively isolated and uninteresting. See the
|
|
|
|
comments in FileInode.h for details, but I'll enumerate the currently legal
|
|
|
|
transitions here.
|
|
|
|
|
|
|
|
- not loaded ⟶ loading
|
|
|
|
- not loaded ⟶ materialized (O_TRUNC)
|
|
|
|
- loading ⟶ loaded
|
|
|
|
- loading ⟶ materialized (O_TRUNC)
|
|
|
|
- loaded ⟶ materialized
|
|
|
|
|
|
|
|
[TODO: dematerialization]
|
|
|
|
|
|
|
|
## Addenda
|
|
|
|
|
|
|
|
### atime
|
|
|
|
|
|
|
|
It is very hard and probably not useful for Eden to try to accurately maintain
|
|
|
|
last-access times for files. In fact, FUSE does not really try:
|
|
|
|
|
|
|
|
https://sourceforge.net/p/fuse/mailman/message/34448996/
|
|
|
|
|
|
|
|
### #pragma once
|
|
|
|
|
|
|
|
On a previous version of Eden, I saw some intermittent build failures that
|
|
|
|
looked like this:
|
|
|
|
|
|
|
|
```
|
|
|
|
rocksdb/src/db/memtable_list.h:40:7: error: redefinition of 'class rocksdb::MemTableListVersion'
|
|
|
|
rocksdb/src/db/memtable_list.h:40:7: error: previous definition of 'class rocksdb::MemTableListVersion'
|
|
|
|
```
|
|
|
|
|
|
|
|
The issue was that Eden would occasionally allocate a new inode number for a
|
|
|
|
nonmaterialized file, and `#pragma once` relies on consistent inode numbers to
|
|
|
|
avoid including the same file twice. Previously, we had some open questions
|
|
|
|
about whether Eden really did need to provide 100% consistent inode numbers for
|
|
|
|
nonmaterialized files, but it seems the answer is yes, at least while the mount
|
|
|
|
is up (including graceful takeover).
|
|
|
|
|
|
|
|
### make
|
|
|
|
|
|
|
|
Make uses the filesystem to remember whether to rebuild a target. It does so
|
|
|
|
by comparing the mtime of the target with its dependencies. If the target is
|
|
|
|
newer than all dependency, it is not rebuilt.
|
|
|
|
|
|
|
|
For Eden to avoid spurious rebuilds with make projects, it must strive to
|
|
|
|
remember mtimes allocated to unmodified files (and thus presumably the
|
|
|
|
unmodified file's inode number). If checking out from unmodified tree A to
|
|
|
|
tree B forgets that directory's inode numbers and the inode numbers of its
|
|
|
|
children, the mtimes allocated to the source files could appear to advance,
|
|
|
|
causing spurious builds.
|