sapling/eden/fs/journal/JournalDelta.h

202 lines
6.3 KiB
C
Raw Normal View History

/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* This software may be used and distributed according to the terms of the
* GNU General Public License version 2.
*/
#pragma once
#include <chrono>
#include <type_traits>
#include <unordered_set>
#include <variant>
additional query API for our thrift interface Summary: This diff adds a couple more things to our thrift interface: 1. Introduces JournalPosition 2. Adds methods to query the current JournalPosition and obtain a delta since a given JournalPosition 3. Augments getMaterializedFiles to also return the current JournalPosition 4. Adds a method to evaluate a `glob` against Eden 5. Adds a method using thrift streaming to subscribe to realtime changes Could probably finesse the naming a little bit. The JournalPosition allows reasoning about changes to files that are not part of an Eden snapshot. Internally the journal position is just the SequenceNumber from the journal datastructures, but when we expose it to clients we need to be able to distinguish between a sequence number from the current instance of the eden service and a prior incarnation (eg: if the process has been restarted, and we have no way to recreate the journal we need to be able to indicate this to the client if they ask about changes in that range). For the convenience of the client we also include the `toHash` (the most recent hash from the journal entry) which is likely useful for the `hg` dirstate operations; it is useful to know that the snapshot may have changed since the last query about the dirstate. The `getFileInformation` method returns the instantaneously available `stat()` like information about the requested list of files. Since we simply don't have historical data on how files in the overlay looked (only how they look now), this method does not allow passing in a JournalPosition. When it comes to comparing historical data, we will need to add an API that accepts two snapshot hashes and generates the results from there. This particular method is geared up to understanding the current state of the world; the obvious use case is plugging in the file list from `getFilesChangedSince` into this function to figure out what's what. * Do we want a function that combines `getFilesChangedSince` + `getFileInformation` into a single RPC? Why is there a glob method? It's to support a use-case in the watchman/buck integration. I'm just sketching it out in the thrift interface at this stage. In the future we also need to be able to express how to carry out a tree walk, but that will require some query predicates that I don't want to get hung up on specifying immediately. Why is the streaming stuff in its own thrift file? We can't generate code for it in java or perhaps also python. It's only needed to plumb data into watchman so it's broken out into its own definition. Nothing depends on that file yet, so it's probably not specified quite right. The important thing is how the subscribe method looks: it's essentially the same as the method to query a delta, but it keeps emitting deltas as they are produced. This is another API that will benefit from query predicates when we get around to specifying them. I've added `JournalDelta::fromHash` and `JournalDelta::toHash` to hold the appropriate snapshot ids in the journal entry; this will allow us to indicate when we've checked out a new snapshot, or created a new snapshot. We have no way to populate these yet; I commented on D3762646 about storing the `snapshotID` that we have during `EdenServiceHandler::mountImpl` into either the `EdenMount` or the proposed `RootInode` class. Once we have that we can simply sample it and store it as we generate `JournalDelta`s. Reviewed By: simpkins Differential Revision: D3860804 fbshipit-source-id: 896c24c354e6f58328fb45c24b16915d9e937108
2016-09-19 22:48:12 +03:00
#include "eden/fs/model/Hash.h"
#include "eden/fs/utils/PathFuncs.h"
namespace facebook {
namespace eden {
struct PathChangeInfo {
PathChangeInfo() : existedBefore{false}, existedAfter{false} {}
PathChangeInfo(bool before, bool after)
: existedBefore{before}, existedAfter{after} {}
bool isNew() const {
return !existedBefore && existedAfter;
}
Compact Repeated Actions Summary: To save on memory the journal will now compact the same action repeated multiple times into the same action. This means that modifying the same file 100 times in a row results in 1 Journal delta instead of 100. [The results will cause Watchman to act the same since all queries are down from the current time, changes should only be visible by the number of deltas in the journal, how much memory the deltas are using, "eden debug journal" which will show that sequenceID's were skipped, and the fromSequence/fromTime returned by accumulateRange might be different] **Memory Improvements:** For buck commands, 1 run was conducted for each with a buck clean done before each build and then eden being restarted (so the clean did not affect the outcome) [results are formatted as 'with compaction' / 'without compaction'] “buck build mode/opt eden” Entries: 154145 / 206108 [25.2% reduction] Memory: 46.2 MB / 61.4 MB [24.7% reduction] “buck build mode/opt warm_storage/common/...” Entries: 318820 / 405016 [21.3% reduction] Memory: 95.8 MB / 121.5 MB [21.2% reduction] For Nuclide the result was calculated by getting the number of entries in the journal vs the last sequence ID in the journal ('entries we actually have' / 'entries we would have without compaction') Using Nuclide’s Smart Log and Checking Out various commits / arc pulling: Entries: 6091 / 23671 [74.3% reduction] Reviewed By: chadaustin Differential Revision: D16096960 fbshipit-source-id: f542ae32c889ebc9da442285d808ce75247f7e65
2019-07-25 03:27:47 +03:00
bool operator==(const PathChangeInfo& other) const {
return existedBefore == other.existedBefore &&
existedAfter == other.existedAfter;
}
bool operator!=(const PathChangeInfo& other) const {
return !(*this == other);
}
/// Whether this path existed at the start of this delta.
bool existedBefore : 1;
/**
* Whether this path existed at the end of this delta.
* If existedAfter && !existedBefore, then the file can be considered new in
* this delta.
*/
bool existedAfter : 1;
// TODO: It may make sense to maintain an existenceChanged bit to distinguish
// between a file being changed and it being removed and added in the same
// delta.
};
class JournalDelta {
public:
using SequenceNumber = uint64_t;
/** The ID of this Delta in the Journal */
JournalDelta::SequenceNumber sequenceID;
/** The time at which the change was recorded. */
std::chrono::steady_clock::time_point time;
};
/** A delta that stores information about changed files */
class FileChangeJournalDelta : public JournalDelta {
public:
do a better job at reporting "new" in watchman results. Summary: We're seeing that this is always set to true for eden, which is causing buck to run slower than it should. To make this work correctly, I've augmented our journal data structure so that it can track create, change and remove events for the various paths. I've also plumbed rename events into the journal. This requires a slightly more complex merge routine, so I've refactored the two call sites that were merging in slightly different contexts so that they can now share the same guts of the merge routine. Perhaps slightly counterintuitive in the merge code is that we merge a record from the past into the state for now and this is a bit backwards compared to how people think. I've expanded the eden integration test to check that we don't mix up create/change/removes for the same path in a given window. On the watchman side, we use the presence of the filename in the createdPaths set as a hint that the file is new. In that case we will set the watchman `ctime` (which is not the state ctime but is really the *created clock time*) to match the current journal position if the file is new, or leave it set to 0 if the file is not known to be new. This will cause the `is_new` flag to be set appropriately by the code in `watchman/query/eval.cpp`; if the sequence is 0 then it should never be set to true. Otherwise (when the file was in the `createPaths` set) it will be set to the current journal position and this will be seen as newer than the `since` constraint on the query and cause the file to show as `new`. Reviewed By: bolinfest Differential Revision: D5608538 fbshipit-source-id: 8d78f7da05e5e53110108aca220c3a97794f8cc2
2017-08-11 22:51:51 +03:00
enum Created { CREATED };
enum Removed { REMOVED };
enum Changed { CHANGED };
enum Renamed { RENAMED };
enum Replaced { REPLACED };
FileChangeJournalDelta() = default;
FileChangeJournalDelta(FileChangeJournalDelta&&) = default;
FileChangeJournalDelta& operator=(FileChangeJournalDelta&&) = default;
FileChangeJournalDelta(const FileChangeJournalDelta&) = delete;
FileChangeJournalDelta& operator=(const FileChangeJournalDelta&) = delete;
FileChangeJournalDelta(RelativePathPiece fileName, Created);
FileChangeJournalDelta(RelativePathPiece fileName, Removed);
FileChangeJournalDelta(RelativePathPiece fileName, Changed);
Distinguish between "renaming" and "replacing" a file in the journal. Summary: Historically, we have seen a number of messages like the following in the Eden logs: ``` Journal for .hg/blackbox.log holds invalid Created, Created sequence ``` Apparently we were getting these invalid sequences because we were not always recording a "rename" correctly. The "rename" constructor for a `JournalDelta` assumed that the source path should be included in the list of "removed" files while the destination path should be included in the list of "created" files. However, that is not accurate if the destination path already existed before the user ran `mv`. Fortunately, we already check whether the destination file exists in `TreeInode::doRename()`, so it is straightforward to determine whether the action is a "rename" (destination does not exist) or an "replace" (destination already exists) and then classify the destination path accordingly. As demonstrated by the new test introduced in this commit (`JournalUpdateTest::moveFileReplace`), in the old implementation, a file that was removed after it was overwritten would not show up as removed in the merged `JournalDelta`. Because Watchman relies on `JournalDelta::merge()` via the Thrift method `getFilesChangedSince()`, this would cause Watchman to report such a file as still existing even though it was removed. This definitely caused bugs in Nuclide. It is likely that other tools that rely on Watchman in Eden (such as Buck) may have also done incorrect things because of this bug, so this could explain past reported issues. Reviewed By: simpkins Differential Revision: D7888249 fbshipit-source-id: 3e57963f27c5421a6175d1a759db8d9597ed76f3
2018-05-08 00:15:45 +03:00
/**
* "Renamed" means that that newName was created as a result of the mv(1).
*/
FileChangeJournalDelta(
RelativePathPiece oldName,
RelativePathPiece newName,
Renamed);
Distinguish between "renaming" and "replacing" a file in the journal. Summary: Historically, we have seen a number of messages like the following in the Eden logs: ``` Journal for .hg/blackbox.log holds invalid Created, Created sequence ``` Apparently we were getting these invalid sequences because we were not always recording a "rename" correctly. The "rename" constructor for a `JournalDelta` assumed that the source path should be included in the list of "removed" files while the destination path should be included in the list of "created" files. However, that is not accurate if the destination path already existed before the user ran `mv`. Fortunately, we already check whether the destination file exists in `TreeInode::doRename()`, so it is straightforward to determine whether the action is a "rename" (destination does not exist) or an "replace" (destination already exists) and then classify the destination path accordingly. As demonstrated by the new test introduced in this commit (`JournalUpdateTest::moveFileReplace`), in the old implementation, a file that was removed after it was overwritten would not show up as removed in the merged `JournalDelta`. Because Watchman relies on `JournalDelta::merge()` via the Thrift method `getFilesChangedSince()`, this would cause Watchman to report such a file as still existing even though it was removed. This definitely caused bugs in Nuclide. It is likely that other tools that rely on Watchman in Eden (such as Buck) may have also done incorrect things because of this bug, so this could explain past reported issues. Reviewed By: simpkins Differential Revision: D7888249 fbshipit-source-id: 3e57963f27c5421a6175d1a759db8d9597ed76f3
2018-05-08 00:15:45 +03:00
/**
* "Replaced" means that that newName was overwritten by oldName as a result
* of the mv(1).
*/
FileChangeJournalDelta(
RelativePathPiece oldName,
RelativePathPiece newName,
Replaced);
additional query API for our thrift interface Summary: This diff adds a couple more things to our thrift interface: 1. Introduces JournalPosition 2. Adds methods to query the current JournalPosition and obtain a delta since a given JournalPosition 3. Augments getMaterializedFiles to also return the current JournalPosition 4. Adds a method to evaluate a `glob` against Eden 5. Adds a method using thrift streaming to subscribe to realtime changes Could probably finesse the naming a little bit. The JournalPosition allows reasoning about changes to files that are not part of an Eden snapshot. Internally the journal position is just the SequenceNumber from the journal datastructures, but when we expose it to clients we need to be able to distinguish between a sequence number from the current instance of the eden service and a prior incarnation (eg: if the process has been restarted, and we have no way to recreate the journal we need to be able to indicate this to the client if they ask about changes in that range). For the convenience of the client we also include the `toHash` (the most recent hash from the journal entry) which is likely useful for the `hg` dirstate operations; it is useful to know that the snapshot may have changed since the last query about the dirstate. The `getFileInformation` method returns the instantaneously available `stat()` like information about the requested list of files. Since we simply don't have historical data on how files in the overlay looked (only how they look now), this method does not allow passing in a JournalPosition. When it comes to comparing historical data, we will need to add an API that accepts two snapshot hashes and generates the results from there. This particular method is geared up to understanding the current state of the world; the obvious use case is plugging in the file list from `getFilesChangedSince` into this function to figure out what's what. * Do we want a function that combines `getFilesChangedSince` + `getFileInformation` into a single RPC? Why is there a glob method? It's to support a use-case in the watchman/buck integration. I'm just sketching it out in the thrift interface at this stage. In the future we also need to be able to express how to carry out a tree walk, but that will require some query predicates that I don't want to get hung up on specifying immediately. Why is the streaming stuff in its own thrift file? We can't generate code for it in java or perhaps also python. It's only needed to plumb data into watchman so it's broken out into its own definition. Nothing depends on that file yet, so it's probably not specified quite right. The important thing is how the subscribe method looks: it's essentially the same as the method to query a delta, but it keeps emitting deltas as they are produced. This is another API that will benefit from query predicates when we get around to specifying them. I've added `JournalDelta::fromHash` and `JournalDelta::toHash` to hold the appropriate snapshot ids in the journal entry; this will allow us to indicate when we've checked out a new snapshot, or created a new snapshot. We have no way to populate these yet; I commented on D3762646 about storing the `snapshotID` that we have during `EdenServiceHandler::mountImpl` into either the `EdenMount` or the proposed `RootInode` class. Once we have that we can simply sample it and store it as we generate `JournalDelta`s. Reviewed By: simpkins Differential Revision: D3860804 fbshipit-source-id: 896c24c354e6f58328fb45c24b16915d9e937108
2016-09-19 22:48:12 +03:00
/** Which of these paths actually contain information */
RelativePath path1;
RelativePath path2;
PathChangeInfo info1;
PathChangeInfo info2;
bool isPath1Valid = false;
bool isPath2Valid = false;
std::unordered_map<RelativePath, PathChangeInfo> getChangedFilesInOverlay()
const;
/** Checks whether this delta is a modification */
bool isModification() const;
/** Checks whether this delta and other are the same disregarding time and
* sequenceID [whether they do the same action] */
bool isSameAction(const FileChangeJournalDelta& other) const;
/** Get memory used (in bytes) by this Delta */
size_t estimateMemoryUsage() const;
};
/** A delta that stores information about changing commits */
class HashUpdateJournalDelta : public JournalDelta {
public:
HashUpdateJournalDelta() = default;
HashUpdateJournalDelta(HashUpdateJournalDelta&&) = default;
HashUpdateJournalDelta& operator=(HashUpdateJournalDelta&&) = default;
HashUpdateJournalDelta(const HashUpdateJournalDelta&) = delete;
HashUpdateJournalDelta& operator=(const HashUpdateJournalDelta&) = delete;
/** The snapshot hash that we started and ended up on.
* This will often be the same unless we perform a checkout or make
* a new snapshot from the snapshotable files in the overlay. */
Hash fromHash;
augment JournalDelta with unclean paths on snapshot hash change Summary: We were previously generating a simple JournalDelta consisting of just the from/to snapshot hashes. This is great from a `!O(repo)` perspective when recording what changed but makes it difficult for clients downstream to reason about changes that are not tracked in source control. This diff adds a concept of `uncleanPaths` to the journal; these are paths that we think are/were different from the hashes in the journal entry. Since JournalDelta needs to be able to be merged I've opted for a simple list of the paths that have a differing status; I'm not including all of the various dirstate states for this because it is not obvious how to reconcile the state across successive snapshot change events. The `uncleanPaths` set is populated with an initial set of different paths as the first part of the checkout call (prior to changing the hash), and then is updated after the hash has changed to capture any additional differences. Care needs to be taken to avoid recursively attempting to grab the parents lock so I'm replicating just a little bit of the state management glue in the `performDiff` method. The Journal was not setting the from/to snapshot hashes when merging deltas. This manifested in the watchman integration tests; we'd see the null revision as the `from` and the `to` revision held the `from` revision(!). On the watchman side we need to ask source control to expand the list of files that changed when the from/to hashes are different; I've added code to handle this. This doesn't do anything smart in the case that the source control aware queries are in use. We'll look at that in a following diff as it isn't strictly eden specific. `watchman clock` was returning a basically empty clock unconditionally, which meant that most since queries would report everything since the start of time. This is most likely contributing to poor Buck performance, although I have not investigated the performance aspect of this. It manifested itself in the watchman integration tests. Reviewed By: simpkins Differential Revision: D5896494 fbshipit-source-id: a88be6448862781a1d8f5e15285ca07b4240593a
2017-10-17 08:22:18 +03:00
/** The set of files that had differing status across a checkout or
* some other operation that changes the snapshot hash */
std::unordered_set<RelativePath> uncleanPaths;
/** Get memory used (in bytes) by this Delta */
size_t estimateMemoryUsage() const;
};
class JournalDeltaPtr {
public:
/* implicit */ JournalDeltaPtr(std::nullptr_t);
Compact Repeated Actions Summary: To save on memory the journal will now compact the same action repeated multiple times into the same action. This means that modifying the same file 100 times in a row results in 1 Journal delta instead of 100. [The results will cause Watchman to act the same since all queries are down from the current time, changes should only be visible by the number of deltas in the journal, how much memory the deltas are using, "eden debug journal" which will show that sequenceID's were skipped, and the fromSequence/fromTime returned by accumulateRange might be different] **Memory Improvements:** For buck commands, 1 run was conducted for each with a buck clean done before each build and then eden being restarted (so the clean did not affect the outcome) [results are formatted as 'with compaction' / 'without compaction'] “buck build mode/opt eden” Entries: 154145 / 206108 [25.2% reduction] Memory: 46.2 MB / 61.4 MB [24.7% reduction] “buck build mode/opt warm_storage/common/...” Entries: 318820 / 405016 [21.3% reduction] Memory: 95.8 MB / 121.5 MB [21.2% reduction] For Nuclide the result was calculated by getting the number of entries in the journal vs the last sequence ID in the journal ('entries we actually have' / 'entries we would have without compaction') Using Nuclide’s Smart Log and Checking Out various commits / arc pulling: Entries: 6091 / 23671 [74.3% reduction] Reviewed By: chadaustin Differential Revision: D16096960 fbshipit-source-id: f542ae32c889ebc9da442285d808ce75247f7e65
2019-07-25 03:27:47 +03:00
/* implicit */ JournalDeltaPtr(FileChangeJournalDelta* p);
/* implicit */ JournalDeltaPtr(HashUpdateJournalDelta* p);
size_t estimateMemoryUsage() const;
explicit operator bool() const noexcept {
return !std::holds_alternative<std::monostate>(data_);
}
/** If this JournalDeltaPtr points to a FileChangeJournalDelta then returns
* the raw pointer, if it does not point to a FileChangeJournalDelta then
* return nullptr. */
FileChangeJournalDelta* getAsFileChangeJournalDelta();
const JournalDelta* operator->() const noexcept;
private:
std::variant<std::monostate, FileChangeJournalDelta*, HashUpdateJournalDelta*>
data_;
};
struct JournalDeltaRange {
/** The current sequence range.
* This is a range to accommodate merging a range into a single entry. */
JournalDelta::SequenceNumber fromSequence;
JournalDelta::SequenceNumber toSequence;
/** The time at which the change was recorded.
* This is a range to accommodate merging a range into a single entry. */
std::chrono::steady_clock::time_point fromTime;
std::chrono::steady_clock::time_point toTime;
/** The snapshot hash that we started and ended up on.
* This will often be the same unless we perform a checkout or make
* a new snapshot from the snapshotable files in the overlay. */
Hash fromHash;
Hash toHash;
/**
* The set of files that changed in the overlay in this update, including
* some information about the changes.
*/
std::unordered_map<RelativePath, PathChangeInfo> changedFilesInOverlay;
/** The set of files that had differing status across a checkout or
* some other operation that changes the snapshot hash */
std::unordered_set<RelativePath> uncleanPaths;
bool isTruncated = false;
JournalDeltaRange() = default;
JournalDeltaRange(JournalDeltaRange&&) = default;
JournalDeltaRange& operator=(JournalDeltaRange&&) = default;
};
} // namespace eden
} // namespace facebook