Summary:
Use `RepoFactory` to construct repositories in the walker.
The walker previously had special handling to allow repositories to
share metadata database and blobstore connections. This is now
implemented in `RepoFactory` itself.
Reviewed By: krallin
Differential Revision: D27400616
fbshipit-source-id: e16b6bdba624727977f4e58be64f8741b91500da
Summary:
We weren't validating any hashes in walker, and so it was entirely possible to
have corrupt content in the repo. Let's start hash validation it by validating hg
filenodes
Reviewed By: ahornby
Differential Revision: D27459516
fbshipit-source-id: 495d59436773d76fcd2ed572e1e724b01012be93
Summary:
When going into a root unode we check if fastlog is derived or not. if it's not
derived we don't even try to check fastlog dirs/files, and that's correct.
However that means that if fastlog batch for a given commit is derived than
we could (and should!) check if all fastlog batches referenced from this
commit exist.
Previously we weren't doing that and this diff fixes it. Same applies for hg
filenode.
Reviewed By: ahornby
Differential Revision: D27401597
fbshipit-source-id: 71ad2744eee33208c44447163cf77bc95ffe98d0
Summary:
Current we treat missing just as any other error, and that makes it hard to
have an alarm on blobs that are missing, because any transient error might make
this alarm go off.
Let's instead differentiate between a blob being missing and failed to fetch a
blob because of any other error (i.e. manifold is unavaialble). Turned out this
is not too difficult to do because a lot of the types implement Loadable trait,
which has Missing variant in its errors.
side-note:
It looks like we have 3 steps which treat missing blobs as not being an
error. These steps are:
1) fastlog_dir_step
2) fastlog_file_step
3) hg_filenode_step
I think this is wrong, and I plan to fix it in the next diff
Reviewed By: ahornby
Differential Revision: D27400280
fbshipit-source-id: e79fff25c41e4d03d77b72b410d6d2f0822c28fd
Summary:
This situation is not normally possible - every hg changeset should have a
corresponding bonsai. So let's return an error if that's not the case.
Reviewed By: farnz
Differential Revision: D27400281
fbshipit-source-id: 4b01b973eeef0e3336c187fb90dd2ab4853b5c02
Summary: Record some more stats so we can see last finish time. Also record update stats for run and chunk number so can see how far along a run is.
Differential Revision: D26949482
fbshipit-source-id: 5e7df4412c25149559883b6e15afa70e1c670cdc
Summary:
The existing query to establish HgChangesetId on the path to FileContentMetadata for LFS validation is quite complex, using HgFilenode linknodes.
This change adds an optional edge from BonsaiHgMappingToHgBonsaiMapping that can be used to simplify the LFS validation case and load less data to get there.
Reviewed By: mitrandir77
Differential Revision: D26975799
fbshipit-source-id: 799acb8228721c1878f33254ebfa5e6345673e5d
Summary:
Like it says in the title. We probably need to find a better way to make sure
we have this everywhere, but for now let's add it here because it's missing.
Reviewed By: StanislavGlebik
Differential Revision: D26949401
fbshipit-source-id: 9325e0a367a1a41fed8997a3a13e0764b9d77e2f
Summary:
Walker args to track correct the originating HgChangeset through history are a bit complicated, add a test for them.
Test showed a bug that HgBonsaiMapping wasn't being tracked in ValidateRoute, so added support for that.
Reviewed By: mitrandir77
Differential Revision: D26945254
fbshipit-source-id: 372574b5e9cde530ba8aecaf1bdc7c7d8aaee54b
Summary: Surprisingly this wasn't aleady in the glog output. Added it so easier to correlate with other logs.
Reviewed By: StanislavGlebik
Differential Revision: D26946047
fbshipit-source-id: b2f6b8097bd1ea6e18a79aa9ac0363582c858d55
Summary: Its useful to know the size as well as the fact its LFS
Reviewed By: mitrandir77, farnz
Differential Revision: D26945223
fbshipit-source-id: 42787b983626ceecf822380e8ec6268646b3338f
Summary: Add a validate check that can log which files are over the LFS size threshold
Reviewed By: farnz
Differential Revision: D26853691
fbshipit-source-id: 414e608358ae0cf6e3f7f55e21caf253a1dc2f9c
Summary:
Some node types don't hold the path as part of their key (e.g.
FileContent) but can still have interesting path information associated with the edges to them.
This is used in next diff to add validation that can report on LFS files
Reviewed By: farnz
Differential Revision: D26945222
fbshipit-source-id: b78347bc81fc02fdc3b71a76522b2986c772440a
Summary: Shouldn't have to be a failure to log from validation. This updates FailureInfo to ValidateInfo in preparation for adding an LFS validation check in next diff
Reviewed By: farnz
Differential Revision: D26853692
fbshipit-source-id: 9fbee1e5b31664365a75aa207f055b7880ce326c
Summary:
AsyncVfs provides async vfs interface.
It will be used in the native checkout instead of current use case that spawns blocking tokio tasks for VFS action
Reviewed By: quark-zju
Differential Revision: D26801250
fbshipit-source-id: bb26c4fc8acac82f4b55bb3f2f3964a6d0b64014
Summary:
Async the query macros. This change also migrates most callsites, with a few more complicated ones handle as separate diffs, which temporarily use sql01::queries in this diff.
With this change the query string is computed lazily (async fn/blocks being lazy) so we're not holding the extra memory of query string as well as query params for quite as long. This is of most interest for queries doing writes where the query string can be large when large values passed (e.g. Mononoke sqlblob blobstore )
Reviewed By: krallin
Differential Revision: D26586715
fbshipit-source-id: e299932457682b0678734f44bb4bfb0b966edeec
Summary:
This diffs add a layer of indirection between fbinit and tokio, thus allowing
us to use fbinit with tokio 0.2 or tokio 1.x.
The way this works is that you specify the Tokio you want by adding it as an
extra dependency alongside `fbinit` in your `TARGETS` (before this, you had to
always include `tokio-02`).
If you use `fbinit-tokio`, then `#[fbinit::main]` and `#[fbinit::test]` get you
a Tokio 1.x runtime, whereas if you use `fbinit-tokio-02`, you get a Tokio 0.2
runtime.
This diff is big, because it needs to change all the TARGETS that reference
this in the same diff that introduces the mechanism. I also didn't produce it
by hand.
Instead, I scripted the transformation using this script: P242773846
I then ran it using:
```
{ hg grep -l "fbinit::test"; hg grep -l "fbinit::main" } | \
sort | \
uniq | \
xargs ~/codemod/codemod.py \
&& yes | arc lint \
&& common/rust/cargo_from_buck/bin/autocargo
```
Finally, I grabbed the files returned by `hg grep`, then fed them to:
```
arc lint-rust --paths-from ~/files2 --apply-patches --take RUSTFIXDEPS
```
(I had to modify the file list a bit: notably I removed stuff from scripts/ because
some of that causes Buck to crash when running lint-rust, and I also had to add
fbcode/ as a prefix everywhere).
Reviewed By: mitrandir77
Differential Revision: D26754757
fbshipit-source-id: 326b1c4efc9a57ea89db9b1d390677bcd2ab985e
Summary:
For dependencies V2 puts "version" as the first attribute of dependency or just after "package" if present.
Workspace section is after patch section in V2 and since V2 autoformats patch section then the third-party/rust/Cargo.toml manual entries had to be formatted manually since V1 takes it as it is.
The thrift files are to have "generated by autocargo" and not only "generated" on their first line. This diff also removes some previously generated thrift files that have been incorrectly left when the corresponding Cargo.toml was removed.
Reviewed By: ikostia
Differential Revision: D26618363
fbshipit-source-id: c45d296074f5b0319bba975f3cb0240119729c92
Summary:
Scrubbing a repo is highly concurrent as its mostly IO bound. As such we can end up waiting on sql connection pool for connections where it allows less than scheduled_max connections.
This change makes bounded_traversal_unique calls from the walker aware of the database tier and shard a Node may connect to, so that execution can be limited to the bounds of what the connection pool can support without waiting.
We still end up waiting for the connection, but now it's done in bounded_traversal_unique, rather than in connection pool code, and are thus a) able to process other Nodes while waiting and b) not subject to connection pool timeouts.
Differential Revision: D26524074
fbshipit-source-id: 19125388c730f5cef7e9de34b5b550efa8e6b825
Summary: Small clean up. Allows us to pass Logger by reference, removing the FIXME in blobrepo factory
Reviewed By: farnz
Differential Revision: D26551592
fbshipit-source-id: d6bb04b8bb3034ad056f071b67b5ae0ce3c6f224
Summary:
The walker mostly checks for duplicates before emitting a new edge, at the same time recording the edge as visited to prevent duplicate edges.
However for derived data where the node may or may not be present, the node isn't considered visited until the node data is successfully loaded and seen in state.rs record_resolved_visit().
In such cases multiple copies of a node could be enqueued, and then we need to run each one.
With this change, where the walker can detect that such a step has completed previously, it will now short circuit the step and return None.
Differential Revision: D26369917
fbshipit-source-id: c2bdbbabfaa80dbb7cc7d2bc25a17230531ae111
Summary:
Adding a new configuration that instantiates SegmentedChangelog by downloading
a dag from a prebuilt blob. It then updates in process.
Reviewed By: krallin
Differential Revision: D26508428
fbshipit-source-id: 09166a3c6de499d8813a29afafd4dfe19a19a2a5
Summary:
The changes (and fixes) needed were:
- Ignore rules that are not rust_library or thrift_library (previously only ignore rust_bindgen_library, so that binary and test dependencies were incorrectly added to Cargo.toml)
- Thrift package name to match escaping logic of `tools/build_defs/fbcode_macros/build_defs/lib/thrift/rust.bzl`
- Rearrange some attributes, like features, authors, edition etc.
- Authors to use " instead of '
- Features to be sorted
- Sort all dependencies as one instead of grouping third party and fbcode dependencies together
- Manually format certain entries from third-party/rust/Cargo.toml, since V2 formats third party dependency entries and V1 just takes them as is.
Reviewed By: zertosh
Differential Revision: D26544150
fbshipit-source-id: 19d98985bd6c3ac901ad40cff38ee1ced547e8eb
Summary: Extract a function so the next diff is easier to read
Differential Revision: D26424694
fbshipit-source-id: f7a64b1d8a114f81875b791f022eb590a0014605
Summary:
Autocargo V2 will use a more structured format for autocargo field
with the help of `cargo_toml` crate it will be easy to deserialize and handle
it.
Also the "include" field is apparently obsolete as it is used for cargo-publish (see https://doc.rust-lang.org/cargo/reference/manifest.html#the-exclude-and-include-fields). From what I know this might be often wrong, especially if someone tries to publish a package from fbcode, then the private facebook folders might be shipped. Lets just not set it and in the new system one will be able to set it explicitly via autocargo parameter on a rule.
Reviewed By: ahornby
Differential Revision: D26339606
fbshipit-source-id: 510a01a4dd80b3efe58a14553b752009d516d651
Summary:
The walker mostly checks for duplicates before emitting a new edge, at the same time recording the edge as visited to prevent duplicate edges.
However for derived data where the node may or may not be present, the node isn't considered visited until the node data is successfully loaded and seen in state.rs record_resolved_visit(). This leaves a gap where we could be executing multiple copies of the same node.
Reviewed By: farnz
Differential Revision: D26319139
fbshipit-source-id: 52ce28f15341f132d94ebc1ff5e8ee2f0dc2564a
Summary: Change from .map_ok() to async move to shift code left a bit
Differential Revision: D26366419
fbshipit-source-id: 833066b45702f36a4ce8d579994d1abb2d739f9e
Summary: Its useful to be able to set the repo bounds used for a walk so that issues can be reproduced with the same inputs as a failing chunk.
Differential Revision: D26342439
fbshipit-source-id: b486387be59a3f4d21e3d3dc407420fc339c150d
Summary: The requirement for boxing is in tail.rs so do the boxing there.
Reviewed By: StanislavGlebik
Differential Revision: D26365153
fbshipit-source-id: 65974fd1e90ca9709ba58200a319ffc6b0fae5db
Summary:
Make the walker use blobstore_factory::make_blobstore like other clients do.
This allows blobstore_factory::{make_blobstore_multiplexed, make_blobstore_put_ops} to be hidden.
Reviewed By: krallin
Differential Revision: D25980596
fbshipit-source-id: 2417ed11d4edc611d19e003122acec6b7ebd341d
Summary: Remove these as common cmdlib ones are preferred.
Reviewed By: krallin
Differential Revision: D25976407
fbshipit-source-id: 30295950e51ad1fb2d88cf396b6a7c7353d17577
Summary: It's useful to be able to pass multiple root types e.g. ChangesetInfoMapping and DeletedManifestMapping to a scrub, but clap was set to deny it.
Reviewed By: markbt
Differential Revision: D26279308
fbshipit-source-id: cacc523ee06ccf50bed0b11b73fa8e84e4990eae
Summary: Update walker to use new common cmdlib scrub options if present. These are common across admin, walker and scrub so they were moved up to cmdlib.
Reviewed By: krallin
Differential Revision: D25976408
fbshipit-source-id: 430bb0c6e8b78470afdfc7cebc44c6645492c6fe
Summary: Add option to allow remaining deferred edges at the end of a walker run so that any repos with unresolved edges can still be tailed.
Reviewed By: StanislavGlebik
Differential Revision: D26230927
fbshipit-source-id: 19eed6a616f722d522c7bca30bbe3bc4dae08655
Summary: Add support for opening the checkpoint database from metadata db config
Differential Revision: D26100513
fbshipit-source-id: 094fab028395ed0324421488bf83b3762c43799a
Summary: If a checkpoint gets too old then we don't want to rely on it, instead start a new walk from scratch and then update the checkpoint.
Differential Revision: D25995107
fbshipit-source-id: 1e05030926123e1066c9b5a42330028d7786c1f3
Summary:
Walker checkpoints allow a scrub or other walk to continue where it left off. This is useful as we can release new code without making the scrub start from scratch again.
This change adds checkpoint loading and recording to tail.rs along with a new test for it.
When restarting from a checkpoint the code considers the unfinished checkpoint itself as the main_bounds and and new commits since the checkpoint as the catchup_bounds.
If there is no checkpoint at all the repo bounds are used as the main_bounds.
Differential Revision: D25995106
fbshipit-source-id: e1663091e4b1157541b256f36b354bbf316a92c9
Summary:
Fix some warnings in the Mononoke build:
- URLs in doc comments should be delimited with `<` and `>`.
- Permission checker `try_from_ssh_encoded` parameter is unused.
Reviewed By: krallin
Differential Revision: D26224590
fbshipit-source-id: 49ce62655189a7045b78538642dbf638519f71de
Summary: Use normal ? style rather than panic with except.
Reviewed By: krallin
Differential Revision: D26176912
fbshipit-source-id: d04ebc4b6c04dd1f8f34b49bee350b52feb11ec1
Summary:
Walker checkpoints will allow a scrub or other walk to continue where it left off. This is useful to release new code and restart jobs without making the job start from scratch again.
This change adds the sqlite schema and SqlCheckpoint implementation for storing checkpoints and is exercised by new tests. Its connected up to walker tailing in the following diff.
Differential Revision: D25995108
fbshipit-source-id: 18a27c95aa7c38f8aa3d432d74de2831213c4ba2
Summary:
When running large repos its interesting to be able to reduce memory usage by clearing all or part of the walk state between chunks.
This adds ability to clear the node data by node type, and to clear the interning maps by the type of interned data type
The only interned type not clearable is the bonsai changeset id as that is used to maintain the list of deferred edges for later chunks to process.
Reviewed By: krallin
Differential Revision: D25867960
fbshipit-source-id: 4b48f03b1a1b8fef1c5ded952bdcd6b1241dcc32
Summary:
Split the WalkVisitor trait into TailingWalkVisitor to allow &mut self references on the methods called from
tail.rs.
This diff also has logging changes to show the chunk bounds and introduce a chunking log tag. This helped in testing.
Reviewed By: krallin
Differential Revision: D26054258
fbshipit-source-id: c74af558f1da98689a38ca61363baf7ee52a265e
Summary: Add a new enum representing the types the walker interns so that we can optionally clear them between chunks. This change adds the enum and command line parsing, actual clearing follows in the next diff.
Reviewed By: krallin
Differential Revision: D25910622
fbshipit-source-id: 0226b4009bf8199498e21e52f734a9529ee7afaa