Summary:
Some linkers (lld being one) use fallocate() or posix_fallocate() on
the output file before writing its contents. EdenFS would return
ENOSYS or ENOTSUP so glibc would fall back and write a single byte to
every 512 byte block, which is terribly slow and generates a bunch of
fake traffic in the Watchman journal.
This diff implements basic support for FUSE_FALLOCATE, avoiding this
slow emulation.
Reviewed By: xavierd
Differential Revision: D25934694
fbshipit-source-id: c6c90ea2b517d4dbedce29d9a4340870c8c177c3
Summary: Something is wrong with this which causes Unity to freak out.
Reviewed By: fanzeyi
Differential Revision: D25453230
fbshipit-source-id: 89f61fd97817403fa65071ddac022a226b775e53
Summary:
A while back, we saw that concurrent directory creation would lead to EdenFS
being confused and failing to record some of the created directories. This then
caused EdenFS to no longer being in sync with what was on disk. To handle this
case, we've had to manually creating these directories recursively.
What I didn't realize at the time was that these concurrent notifications could
also happen on removal this time, and if a directory removal notification wins
the race against the removal of its last children, that directory wouldn't be
removed and EdenFS would once again be confused about the state of the
repository.
Fixing this is a bit trickier than directory creation as it's more racier.
Consider a directory that is being removed, and then immediately recreated with
a file in it in a different process. The naive approach of simply force
removing all of the children of a directory when handling the removal
notification would clash with the file creation. We could argue that nobody
should be doing this, but there would be an unhandled race, and thus a bug
where data would potentially be lost[0].
We can however fix this bug slightly differently. For file/directory removal,
we can actually hook onto the pre-callback, ie: one that happens before the
file/directory is no longer visible on disk. This inherently eliminate the race
altogether as the callback will be guaranteed to run when none of its children
are present, and if a race happens with a file creation in it, we can simply
fail the removal properly.
The only tricky bit is for the renaming logic, as renaming a file is logically
a removal followed by a creation. For that reason, I've moved part of the
renaming bits to the pre-callback too.
In theory, this change may negatively affect workloads that do concurrent
directory removal as the duration during which a file/directory is visible
ondisk now includes the EdenFS callback while it didn't before. Such workflows
should be fairly rare and/or redirected to avoid EdenFS altogether if
performance matters.
[0]: This left-over file that EdenFS wouldn't be aware of would also later
cause the checkout code to fail due to invalidation failures triggered when
trying to invalidate that directory. This would be fairly hard to debug.
Reviewed By: fanzeyi
Differential Revision: D25112381
fbshipit-source-id: 9300499ce872ad93d0a687f0e61b7e2a9caf9556
Summary:
As of right now, opendir is the expensive callbacks due to fetching the sizes
for all the files in a directory. This strategy however breaks down when
timeouts are added as very large directories would trigger the timeout, not
because EdenFS is having a hard time reaching Mononoke, but because of
bandwidth limitation.
To avoid this issue, we need to have a per-file timeout and thus makes opendir
just trigger the futures, but not wait on them. The waiting bit will happen
at readdir time which will also enable having a timeout per file.
The one drawback to this is that the ObjectFetchContext that is passed in by
opendir cannot live long enough to be used in the size future. For now, we can
use a null context, but a proper context will need to be passed in, in the
future.
Reviewed By: wez
Differential Revision: D24895089
fbshipit-source-id: e10ceae2f7c49b4a006b15a34f85d06a2863ae3a
Summary:
It would be nice to see if there was a fsck on startup, the duration of the fsck, and if it was able to repair all of the problems or not. This diff adds external logging for fsck runs on daemon start.
duration: how long the fsck took
success: false if was not able to repair errors, true if repaired all errors or didn't have to repair at all
attempted_repair: true if we found problems, false otherwise
Reviewed By: chadaustin
Differential Revision: D24774065
fbshipit-source-id: 2fa911652abec889299c74aaa2d53718ed3b4f92
Summary:
The EdenFS codebase uses folly/logging/xlog to log, but we were still relying
on glog for the various CHECK macros. Since xlog also contains equivalent CHECK
macros, let's just rely on them instead.
This is mostly codemodded + arc lint + various fixes to get it compile.
Reviewed By: chadaustin
Differential Revision: D24871174
fbshipit-source-id: 4d2a691df235d6dbd0fbd8f7c19d5a956e86b31c
Summary: Fix some build warnings that show up on macOS.
Reviewed By: wez
Differential Revision: D24436572
fbshipit-source-id: bd300110df79fe9d259c8f5ce2b1085cb85a0cac
Summary: Add a reliable, lightweight TraceBus class for publishing events to a background thread. Subscribers can be registered for observing events or computing telemetry about them.
Reviewed By: wez
Differential Revision: D23404525
fbshipit-source-id: 3539466421b0821ffb918ea862168d3cccd19b15
Summary:
Similarly to the other callbacks, this makes the main function return to
ProjectedFS as soon as the future is created which will allow for it to be
interrupted in a subsequent diff.
Reviewed By: fanzeyi
Differential Revision: D23745754
fbshipit-source-id: 2d77d0eacfe0d37eb9075bf9f0660e4f4af77e8f
Summary: Similarly to the other one, this will make it possible to interrupt.
Reviewed By: fanzeyi
Differential Revision: D23643100
fbshipit-source-id: 0daab1cec94d0e177bb707d97bf928b05d5d24a3
Summary: Similarly to the other callback, this will make it possible to interrupt.
Reviewed By: fanzeyi
Differential Revision: D23643101
fbshipit-source-id: 9f9a48e752a850c63255b8867b980163cb6a92c9
Summary:
The opendir callback tend to be the most expensive of all due to having to
fetch the content of all the files. This leads to some frustrating UX as the
`ls` operation cannot be interrupted. By making this asynchronous, the slow
operation can be interrupted. The future isn't cancelled and thus it will
continue to fetch in the background, this will be tackled in a future diff.
Reviewed By: fanzeyi
Differential Revision: D23630462
fbshipit-source-id: f1c4a9fbd9daa18ca4b8f4837c5241a37ccfbcf9
Summary:
Now that all the pieces are in place, we can plumb the request context in. For
now, this adds it to only one callback as I figure out more about it and tweak
it until I have something satisfactory. There are some rough edges with it that
I'm not entirely happy about, but as I change the notification callback to be
more async, I'm hoping to make more convenient to use and less clanky.
Reviewed By: fanzeyi
Differential Revision: D23505508
fbshipit-source-id: d5f12e22a8f67dfa061b8ad82ea718582c323b45
Summary:
This helps make RequestData slightly more generic by depending less on Fuse
specific types/code.
Reviewed By: chadaustin
Differential Revision: D23467487
fbshipit-source-id: 830f8269e2114c2968dcc49d3b5574e687191e4d
Summary:
This commit introduces a new process spawning class derived
from the ChildProcess class in the watchman codebase.
`SpawnedProcess` is similar to folly::Subprocess but is designed around the
idea that we will use a system provided spawning API to start a process, rather
than assuming the use of `fork`.
`fork` is to be avoided because it can be expensive for processes with large
address spaces and also because it interacts poorly with threads on macOS. In
particular, we see the objC runtime terminating our process in some scenarios
where fork and threads are mixed.
There are some important differences from `folly::Subprocess` and that means
that some assumptions and uses need to be altered slightly from their prior
workings. For example, detaching a SpawnedProcess moves the responsibility of
waiting on the child to a periodic task as there is no way to detach via
posix_spawn without also using fork.
On the plus side, this commit allows unifying spawning between posix and
windows systems, which simplifies the code!
Reviewed By: xavierd
Differential Revision: D23287763
fbshipit-source-id: b662af1d7eaaa9ed445c42f6c5765ae9af975eea
Summary:
Having the type of data fetched can help in debugging where these fetches are
comming from. In the currently logs figuring out if a data fetch is blob or
tree requires some manual work. When looking at a big bunch of fetches this is
not super practical.
So this includes this info in our logging.
Reviewed By: chadaustin
Differential Revision: D23243444
fbshipit-source-id: 9abe5180c5d2afc0d02b27ba6a6b76401e86556e
Summary:
With Mercurial now supporting CMD_CAT_TREE for efficiently fetching and reading
trees, we can plumb this onto EdenFS. At startup time, we detect whether
Mercurial supports CMD_CAT_TREE and use that method, otherwise, we fallback to
the old CMD_FETCH_TREE.
Reviewed By: wez
Differential Revision: D23044953
fbshipit-source-id: 9aea5c5b82e97039a75ef18976a155dcb6e150bc
Summary:
As opposed to FUSE, ProjectedFS sends notifications for file/directory creation
after the fact, and for directory that means these will be visible on disk before
EdenFS may be aware of it. While EdenFS usually process it quickly, a heavily
multi-threaded application that tries to concurrently create a directory
hierarchy may end up sending notifications to EdenFS in a somewhat out of order
fashion.
Since this should be a very rare occurence, we make this a very slow path by
being optimistic and calling `getInode` first, and then only if that fails, we
aggressively create all the parent directories. During a buck build of ~1k
jobs, this happened only 3 times.
If we fully think this through, this change doesn't fully fix the race, as a
similar race can now happen when a create and remove/rename operations are
concurrent. However, a client performing these operations concurrently is
either aware that this is racy and should handle these properly, or is most
likely buggy. Both of these should significantly reduce the likelyhod of this
happening, thus, I'm leaving this unfixed for now.
To better understand how frequently this happens, I've added a stat counter.
For now, these aren't published to ODS, but this will be tackled later.
Reviewed By: wez
Differential Revision: D22783484
fbshipit-source-id: ea3aafc2f77b65d3967f697f68114921d5909137
Summary: This diff updated `ObjectStore` to send a `FetchHeavy` event to Scuba when the number of fetching requests of a process has reached 2000.
Reviewed By: fanzeyi
Differential Revision: D22292104
fbshipit-source-id: b7ac48412868216ea960c8681a5fb71c587952bc
Summary:
Both optional and pid_t weren't found and the right includes needed to be
provided. On Windows, the ProcessNameCache isn't compiled (yet), and since it
looks like the process name is optional in the BackingStoreLogger, let's not
provide it for now.
Reviewed By: fanzeyi
Differential Revision: D22215581
fbshipit-source-id: 31a7e7be62cd3d14108dc437d3dfabfb9e62f8d5
Summary:
We have seen that eden will unexpectedly fetch data, we want to know why.
This adds the plumbing to interact with edens current logging to be able to
log when eden fetches data from the server and what caused eden to do this
fetch. Later changes will use the classes created here to log the cause of data
fetches.
Reviewed By: chadaustin
Differential Revision: D22051013
fbshipit-source-id: 27d377d7057e66f3e7a304cd7004f8aa44f8ba62
Summary:
Google Benchmark is easier to use, has more built-in functionality,
and more accurate default behavior than Folly Benchmark, so switch
EdenFS to use it.;
Reviewed By: simpkins
Differential Revision: D20273672
fbshipit-source-id: c90c49878592620a83d2821ed4bc75c20e599a75
Summary:
This sets up the counters that will allow us to expose the number
of pending FUSE requests in Eden top.
As D20846826 mentions adding metrics for FUSE request gives
visibility into fuse requests and overall health of eden.
This provides more insight beyond the metrics for live FUSE requests
since it shows the kernels view of FUSE requests. Looking at the difference
between the number of pending and live request, can identify issues
that arise at the interface between eden and FUSE and monitor how
quickly fuse workers are processing requests.
**note**: this is only for linux since macos has no equivalent to
`/sys/fs/fuse/connections`
Reviewed By: chadaustin
Differential Revision: D21074489
fbshipit-source-id: c0951f0dfd4fa764be28d8686d08cd0dd807db37
Summary:
Tl;DR Reduces costs of fuse request mertics by reducing lock contention.
D20922194 adds tracking for FUSE request metrics, this makes tracking those
metrics more efficient. Since every user request goes through the FUSE channel,
we want to reduce the cost of these metrics as much as possible (originally
mentioned in a comment D20922194).
The synchronization used to track the metrics is costly especially
when the lock is contended.
To reduce the cost, each FuseChannel will have a thread local copy of metrics.
Each will still be synchronized to allow for reading the metrics and for
Requests moved to other threads that will need to access the metrics. However,
the lock should be contended less often since only requests from a single fuse
channel thread will access it.
Reviewed By: chadaustin
Differential Revision: D21043792
fbshipit-source-id: ce58a0cbce334095976233bfac7578d39c81bb55
Summary:
This exposes metrics for the live FUSE requests (the duration
of the longest outstanding request and the number of outstanding
requests).
Because FUSE is the interface through which the user mostly interacts
with the file system they provide good metrics to judge if the perfomance
of eden is normal, or there may be an issue.
Exposing these counters this way will send them to ods, so it will not only
allow for debuging current issues, but can be used to look back at previous
problems. This data could also be used for alerting or more proactive
remediation.
Metrics are exposed per checkout to allow seeing which checkout was
having issues. This data will aggregated in `eden top` to be used as
an overall health indicator, but should more information be needed it
will be logged in ods.
Reviewed By: chadaustin
Differential Revision: D20922194
fbshipit-source-id: 16208883417acb77b62bf712cfdd9068c5420303
Summary:
This sets up the counters that will allow us to expose metrics for
live FUSE requests in Eden top.
This will allow us to track the number of live FUSE requests and the duration
of the longest FUSE request. If the duration of the longest FUSE request has
become very long, the number of FUSE requests is stuck at some value or
increasingly growing these can indicate there is a problem with the Fuse Channel
or something effecting this.
This provides more insight beyond the metrics for imports since many of the
FUSE requests will not cause imports, so this monitors more functionality.
Further because all user interaction with the filesystem goes through FUSE,
these metrics will give a good overall indication weather Eden is performing
normally.
Reviewed By: chadaustin
Differential Revision: D20846826
fbshipit-source-id: 34d834d193833a3c91d95fbb87361ddd863f64ca
Summary:
As mentioned in D20629833, adding metrics for live
imports in `eden top` gives more transparency to the
imports process and makes identifying import related
issues easier. This is set up to expose metrics for live
imports like those for pending imports in `eden top`.
Similar to D20611728 exposing this via these counters
will log this data. Having this data persisted will allow
tracking the performance of imports, and does the set
up for more pro-active fixing of issues. Further we can
look back to see issues that are no longer occurring, but
still of interest.
This also refactors the registration code so that it requires
no copy pasting to add a new counter. Avoiding copy paste
errors when adding more counters and making it easier to
maintain.
Reviewed By: chadaustin
Differential Revision: D20630813
fbshipit-source-id: 8a7a2a0135c7b7a5cde960b84dcb434c6c99eaeb
Summary:
Previously `eden top` shows metrics for imports that are queued
and fetching data (though there is a bug here fixed above), we
want to provide more granularity to help with debugging:
This explicitly separates "current" and "pending" imports:
- "live": The import is live, ie it is currently importing data
- having this metric allows debugging the case that there
is a problem fetching data
- "pending": The import is queued, live, or getting data from the
cache
- having this metric allows debugging the case that there is a
problem initiating the request, for example a request is being
starved on the queue
**note**: I moved the watches closer to the import function call
instead of renaming the currently broken pending import watches
to make it more clear in the code what these are suppose to be
timing
Reviewed By: chadaustin
Differential Revision: D20629833
fbshipit-source-id: 84ef75057149a648a51418a5cc93be87e3b3d6b5
Summary: - added logging only around the import tree call to capture non-queue related wait time
Reviewed By: chadaustin, fanzeyi
Differential Revision: D20207472
fbshipit-source-id: d88bb34ce224a26ff2be100d7789ddeff608006d
Summary:
- added logging only around the import blob call to capture non-queue related wait time
- added to `test_reading_file_gets_file_from_hg` in `integration.stats_test.HgBackingStoreStatsTest` to test import blob logging in addition to the get blob loging
(not yet done for importing trees, will do in next diff)
Reviewed By: chadaustin
Differential Revision: D20201215
fbshipit-source-id: c89281fe7d3d6e89d111ac8cce9014adff44ac40
Summary: records if a start was successful or not
Reviewed By: simpkins
Differential Revision: D19817810
fbshipit-source-id: b67253099781bb534b7e2fb26a09ba41c1f0bd69
Summary: logs information about daemon shutdowns, including duration, if it was a graceful restart or not, and if it was successful or not.
Reviewed By: fanzeyi
Differential Revision: D19817811
fbshipit-source-id: a851de8872f98952d046fb148a27fbf9f05b2212
Summary: In addition to duration and success, log object fetch counts and checkout type to Scuba.
Reviewed By: fanzeyi
Differential Revision: D19334276
fbshipit-source-id: dabf52427f2ebda2b58df93194df39d52f4fcb4f
Summary:
Multiple times I've gotten mixed up by the different casing of the
LogEvent fields versus the names in the structured log. Change the
field names to match the log names.
Reviewed By: genevievehelsel
Differential Revision: D19585307
fbshipit-source-id: 3ccbb9ec5e18155367bc85e9db00bc8d556812c4
Summary:
I have a suspicion that most edenfs starts are unclean. To test that
hypothesis, add a "mount" structured event type that records whether
the mount was cleanly shut down, how long mounting took, and whether
it's a graceful restart.
Reviewed By: genevievehelsel
Differential Revision: D19585161
fbshipit-source-id: 16eedfeb6181742e64b45bca1ed3cce3edb663cb
Summary: scuba logging for mismatched parent commits, it would be nice to have a count on how often this is encountered so we can pinpoint spikes of this case. I am not sure how helpful having the parent commits would be in the event, my thought there was that it could help see how far away the commits are and if one of the commtis doesn't exist (like in the case of stripping commtis with a cancelled hg split), but I am open to having an empty event as well if storing these strings is too expensive.
Reviewed By: fanzeyi
Differential Revision: D19432378
fbshipit-source-id: fa3e7cd6aef832c4f918f339b781590b7484cfdc
Summary:
A spike in automatic GCs usually implies something has gone wrong. Log
an event for each one, recording the cache size prior to the GC and
the cache size after.
Reviewed By: simpkins
Differential Revision: D18902580
fbshipit-source-id: 158b2635733a415a9fcc7c412b2c0f44ed04aa01
Summary:
Allocate and hook up a StructuredLogger at startup if the EdenConfig
has the appropriate values set.
Reviewed By: simpkins
Differential Revision: D18071821
fbshipit-source-id: 3996b6644bbf0c16bb3b9950d57a79d4619a1656
Summary:
Introduce a framework that allows recording structured log events
which are encoded as JSON and piped to a configured command line
program.
Reviewed By: pkaush
Differential Revision: D18025183
fbshipit-source-id: ab6b4d510a905a30252f2cff85d107a0d32d149e
Summary: Formatting had diverged in a few places. Fix that up.
Reviewed By: fanzeyi
Differential Revision: D18123219
fbshipit-source-id: 832cdd70789642f665a029196998928a9173be81
Summary: Introduce a new ScribeLogger class that spawns and maintains a process. Log messages are newline-delimited and written to the process's stdin. If the process stops responding or responds too slowly, log messages are dropped.
Reviewed By: pkaush
Differential Revision: D17777215
fbshipit-source-id: c998d10c73fc103122d69ae19c5d84f58b7939d2
Summary: Tracing was not an accurate name for what this directory had become. So rename it to telemetry.
Reviewed By: wez
Differential Revision: D17923303
fbshipit-source-id: fca07e8447d9b9b3ea5d860809a2d377e3c4f9f2