Summary:
This adds the first part of new logging from the walker that can be used to gather details on what keys might make sense to pack together.
Unlike the corpus command that dumps file content by path (which was useful for analysis on compression approaches), this adds logging to the scrub command that includes the path hash rather than the full path. This should keep memory usage down during the run, hopefully mean we log from existing scrub jobs and and mean the logs are more compact.
Reviewed By: mitrandir77
Differential Revision: D27974448
fbshipit-source-id: 47b55112b47e9b022f16fbb473cf233a7d46bcf3
Summary:
Update the zstd crates.
This also patches async-compression crate to point at my fork until upstream PR https://github.com/Nemo157/async-compression/pull/117 to update to zstd 1.4.9 can land.
Reviewed By: jsgf, dtolnay
Differential Revision: D27942174
fbshipit-source-id: 26e604d71417e6910a02ec27142c3a16ea516c2b
Summary: Its more efficient to bulk load the mappings for a chunk than do the queries one by one
Differential Revision: D27801830
fbshipit-source-id: 9c38ddfb1c1d827fc3028cd09f9ad51e3cbee5dc
Summary: Add an accessor so that we keep a reverse mapping of the WalkState::bcs_to_hg member as a cache of bonsai to hg mappings and also populate it on derivations.
Differential Revision: D27800533
fbshipit-source-id: f9b1c279a78ce3791013c3c83a32251fdc3ad77f
Summary: Add an accessor so that we can use the WalkState::hg_to_bcs member as a cache of hg to bonsai mappings
Reviewed By: farnz
Differential Revision: D27797638
fbshipit-source-id: 44322e93849ea78b255b2e3cb05feb8db6b4c7a7
Summary:
There is a very frustrating operation that happens often when working on the
Mononoke code base:
- You want to add a flag
- You want to consume it in the repo somewhere
Unfortunately, when we need to do this, we end up having to thread this from a
million places and parse it out in every single main() we have.
This is a mess, and it results in every single Mononoke binary starting with
heaps of useless boilerplate:
```
let matches = app.get_matches();
let (caching, logger, mut runtime) = matches.init_mononoke(fb)?;
let config_store = args::init_config_store(fb, &logger, &matches)?;
let mysql_options = args::parse_mysql_options(&matches);
let blobstore_options = args::parse_blobstore_options(&matches)?;
let readonly_storage = args::parse_readonly_storage(&matches);
```
So, this diff updates us to just use MononokeEnvironment directly in
RepoFactory, which means none of that has to happen: we can now add a flag,
parse it into MononokeEnvironment, and get going.
While we're at it, we can also remove blobstore options and all that jazz from
MononokeApiEnvironment since now it's there in the underlying RepoFactory.
Reviewed By: HarveyHunt
Differential Revision: D27767700
fbshipit-source-id: e1e359bf403b4d3d7b36e5f670aa1a7dd4f1d209
Summary:
ScrubOptions normally represents options we parsed from the CLI, but right now
we abuse this a little bit to throw a ScrubHandler into them, which we
sometimes mutate before using this config.
In this stack, I'm unifying how we pass configs to RepoFactory, and this little
exception doesn't really fit. So, let's change this up, and make ScrubHandler
something you may give the RepoFactory if you're so inclined.
Reviewed By: HarveyHunt
Differential Revision: D27767699
fbshipit-source-id: fd38bf47eeb723ec7d62f8d34e706d8581a38c43
Summary:
Basically every single Mononoke binary starts with the same preamble:
- Init mononoke
- Init caching
- Init logging
- Init tunables
Some of them forget to do it, some don't, etc. This is a mess.
To make things messier, our initialization consists of a bunch of lazy statics
interacting with each other (init logging & init configerator are kinda
intertwined due to the fact that configerator wants a logger but dynamic
observability wants a logger), and methods you must only call once.
This diff attempts to clean this up by moving all this initialization into the
construction of MononokeMatches. I didn't change all the accessor methods
(though I did update those that would otherwise return things instantiated at
startup).
I'm planning to do a bit more on top of this, as my actual goal here is to make
it easier to thread arguments from MononokeMatches to RepoFactory, and to do so
I'd like to just pass my MononokeEnvironment as an input to RepoFactory.
Reviewed By: HarveyHunt
Differential Revision: D27767698
fbshipit-source-id: 00d66b07b8c69f072b92d3d3919393300dd7a392
Summary:
Use `RepoFactory` to construct repositories in the walker.
The walker previously had special handling to allow repositories to
share metadata database and blobstore connections. This is now
implemented in `RepoFactory` itself.
Reviewed By: krallin
Differential Revision: D27400616
fbshipit-source-id: e16b6bdba624727977f4e58be64f8741b91500da
Summary:
We weren't validating any hashes in walker, and so it was entirely possible to
have corrupt content in the repo. Let's start hash validation it by validating hg
filenodes
Reviewed By: ahornby
Differential Revision: D27459516
fbshipit-source-id: 495d59436773d76fcd2ed572e1e724b01012be93
Summary:
When going into a root unode we check if fastlog is derived or not. if it's not
derived we don't even try to check fastlog dirs/files, and that's correct.
However that means that if fastlog batch for a given commit is derived than
we could (and should!) check if all fastlog batches referenced from this
commit exist.
Previously we weren't doing that and this diff fixes it. Same applies for hg
filenode.
Reviewed By: ahornby
Differential Revision: D27401597
fbshipit-source-id: 71ad2744eee33208c44447163cf77bc95ffe98d0
Summary:
Current we treat missing just as any other error, and that makes it hard to
have an alarm on blobs that are missing, because any transient error might make
this alarm go off.
Let's instead differentiate between a blob being missing and failed to fetch a
blob because of any other error (i.e. manifold is unavaialble). Turned out this
is not too difficult to do because a lot of the types implement Loadable trait,
which has Missing variant in its errors.
side-note:
It looks like we have 3 steps which treat missing blobs as not being an
error. These steps are:
1) fastlog_dir_step
2) fastlog_file_step
3) hg_filenode_step
I think this is wrong, and I plan to fix it in the next diff
Reviewed By: ahornby
Differential Revision: D27400280
fbshipit-source-id: e79fff25c41e4d03d77b72b410d6d2f0822c28fd
Summary:
This situation is not normally possible - every hg changeset should have a
corresponding bonsai. So let's return an error if that's not the case.
Reviewed By: farnz
Differential Revision: D27400281
fbshipit-source-id: 4b01b973eeef0e3336c187fb90dd2ab4853b5c02
Summary: Record some more stats so we can see last finish time. Also record update stats for run and chunk number so can see how far along a run is.
Differential Revision: D26949482
fbshipit-source-id: 5e7df4412c25149559883b6e15afa70e1c670cdc
Summary:
The existing query to establish HgChangesetId on the path to FileContentMetadata for LFS validation is quite complex, using HgFilenode linknodes.
This change adds an optional edge from BonsaiHgMappingToHgBonsaiMapping that can be used to simplify the LFS validation case and load less data to get there.
Reviewed By: mitrandir77
Differential Revision: D26975799
fbshipit-source-id: 799acb8228721c1878f33254ebfa5e6345673e5d
Summary:
Like it says in the title. We probably need to find a better way to make sure
we have this everywhere, but for now let's add it here because it's missing.
Reviewed By: StanislavGlebik
Differential Revision: D26949401
fbshipit-source-id: 9325e0a367a1a41fed8997a3a13e0764b9d77e2f
Summary:
Walker args to track correct the originating HgChangeset through history are a bit complicated, add a test for them.
Test showed a bug that HgBonsaiMapping wasn't being tracked in ValidateRoute, so added support for that.
Reviewed By: mitrandir77
Differential Revision: D26945254
fbshipit-source-id: 372574b5e9cde530ba8aecaf1bdc7c7d8aaee54b
Summary: Surprisingly this wasn't aleady in the glog output. Added it so easier to correlate with other logs.
Reviewed By: StanislavGlebik
Differential Revision: D26946047
fbshipit-source-id: b2f6b8097bd1ea6e18a79aa9ac0363582c858d55
Summary: Its useful to know the size as well as the fact its LFS
Reviewed By: mitrandir77, farnz
Differential Revision: D26945223
fbshipit-source-id: 42787b983626ceecf822380e8ec6268646b3338f
Summary: Add a validate check that can log which files are over the LFS size threshold
Reviewed By: farnz
Differential Revision: D26853691
fbshipit-source-id: 414e608358ae0cf6e3f7f55e21caf253a1dc2f9c
Summary:
Some node types don't hold the path as part of their key (e.g.
FileContent) but can still have interesting path information associated with the edges to them.
This is used in next diff to add validation that can report on LFS files
Reviewed By: farnz
Differential Revision: D26945222
fbshipit-source-id: b78347bc81fc02fdc3b71a76522b2986c772440a
Summary: Shouldn't have to be a failure to log from validation. This updates FailureInfo to ValidateInfo in preparation for adding an LFS validation check in next diff
Reviewed By: farnz
Differential Revision: D26853692
fbshipit-source-id: 9fbee1e5b31664365a75aa207f055b7880ce326c
Summary:
AsyncVfs provides async vfs interface.
It will be used in the native checkout instead of current use case that spawns blocking tokio tasks for VFS action
Reviewed By: quark-zju
Differential Revision: D26801250
fbshipit-source-id: bb26c4fc8acac82f4b55bb3f2f3964a6d0b64014
Summary:
Async the query macros. This change also migrates most callsites, with a few more complicated ones handle as separate diffs, which temporarily use sql01::queries in this diff.
With this change the query string is computed lazily (async fn/blocks being lazy) so we're not holding the extra memory of query string as well as query params for quite as long. This is of most interest for queries doing writes where the query string can be large when large values passed (e.g. Mononoke sqlblob blobstore )
Reviewed By: krallin
Differential Revision: D26586715
fbshipit-source-id: e299932457682b0678734f44bb4bfb0b966edeec
Summary:
This diffs add a layer of indirection between fbinit and tokio, thus allowing
us to use fbinit with tokio 0.2 or tokio 1.x.
The way this works is that you specify the Tokio you want by adding it as an
extra dependency alongside `fbinit` in your `TARGETS` (before this, you had to
always include `tokio-02`).
If you use `fbinit-tokio`, then `#[fbinit::main]` and `#[fbinit::test]` get you
a Tokio 1.x runtime, whereas if you use `fbinit-tokio-02`, you get a Tokio 0.2
runtime.
This diff is big, because it needs to change all the TARGETS that reference
this in the same diff that introduces the mechanism. I also didn't produce it
by hand.
Instead, I scripted the transformation using this script: P242773846
I then ran it using:
```
{ hg grep -l "fbinit::test"; hg grep -l "fbinit::main" } | \
sort | \
uniq | \
xargs ~/codemod/codemod.py \
&& yes | arc lint \
&& common/rust/cargo_from_buck/bin/autocargo
```
Finally, I grabbed the files returned by `hg grep`, then fed them to:
```
arc lint-rust --paths-from ~/files2 --apply-patches --take RUSTFIXDEPS
```
(I had to modify the file list a bit: notably I removed stuff from scripts/ because
some of that causes Buck to crash when running lint-rust, and I also had to add
fbcode/ as a prefix everywhere).
Reviewed By: mitrandir77
Differential Revision: D26754757
fbshipit-source-id: 326b1c4efc9a57ea89db9b1d390677bcd2ab985e
Summary:
For dependencies V2 puts "version" as the first attribute of dependency or just after "package" if present.
Workspace section is after patch section in V2 and since V2 autoformats patch section then the third-party/rust/Cargo.toml manual entries had to be formatted manually since V1 takes it as it is.
The thrift files are to have "generated by autocargo" and not only "generated" on their first line. This diff also removes some previously generated thrift files that have been incorrectly left when the corresponding Cargo.toml was removed.
Reviewed By: ikostia
Differential Revision: D26618363
fbshipit-source-id: c45d296074f5b0319bba975f3cb0240119729c92
Summary:
Scrubbing a repo is highly concurrent as its mostly IO bound. As such we can end up waiting on sql connection pool for connections where it allows less than scheduled_max connections.
This change makes bounded_traversal_unique calls from the walker aware of the database tier and shard a Node may connect to, so that execution can be limited to the bounds of what the connection pool can support without waiting.
We still end up waiting for the connection, but now it's done in bounded_traversal_unique, rather than in connection pool code, and are thus a) able to process other Nodes while waiting and b) not subject to connection pool timeouts.
Differential Revision: D26524074
fbshipit-source-id: 19125388c730f5cef7e9de34b5b550efa8e6b825
Summary: Small clean up. Allows us to pass Logger by reference, removing the FIXME in blobrepo factory
Reviewed By: farnz
Differential Revision: D26551592
fbshipit-source-id: d6bb04b8bb3034ad056f071b67b5ae0ce3c6f224
Summary:
The walker mostly checks for duplicates before emitting a new edge, at the same time recording the edge as visited to prevent duplicate edges.
However for derived data where the node may or may not be present, the node isn't considered visited until the node data is successfully loaded and seen in state.rs record_resolved_visit().
In such cases multiple copies of a node could be enqueued, and then we need to run each one.
With this change, where the walker can detect that such a step has completed previously, it will now short circuit the step and return None.
Differential Revision: D26369917
fbshipit-source-id: c2bdbbabfaa80dbb7cc7d2bc25a17230531ae111
Summary:
Adding a new configuration that instantiates SegmentedChangelog by downloading
a dag from a prebuilt blob. It then updates in process.
Reviewed By: krallin
Differential Revision: D26508428
fbshipit-source-id: 09166a3c6de499d8813a29afafd4dfe19a19a2a5
Summary:
The changes (and fixes) needed were:
- Ignore rules that are not rust_library or thrift_library (previously only ignore rust_bindgen_library, so that binary and test dependencies were incorrectly added to Cargo.toml)
- Thrift package name to match escaping logic of `tools/build_defs/fbcode_macros/build_defs/lib/thrift/rust.bzl`
- Rearrange some attributes, like features, authors, edition etc.
- Authors to use " instead of '
- Features to be sorted
- Sort all dependencies as one instead of grouping third party and fbcode dependencies together
- Manually format certain entries from third-party/rust/Cargo.toml, since V2 formats third party dependency entries and V1 just takes them as is.
Reviewed By: zertosh
Differential Revision: D26544150
fbshipit-source-id: 19d98985bd6c3ac901ad40cff38ee1ced547e8eb
Summary: Extract a function so the next diff is easier to read
Differential Revision: D26424694
fbshipit-source-id: f7a64b1d8a114f81875b791f022eb590a0014605
Summary:
Autocargo V2 will use a more structured format for autocargo field
with the help of `cargo_toml` crate it will be easy to deserialize and handle
it.
Also the "include" field is apparently obsolete as it is used for cargo-publish (see https://doc.rust-lang.org/cargo/reference/manifest.html#the-exclude-and-include-fields). From what I know this might be often wrong, especially if someone tries to publish a package from fbcode, then the private facebook folders might be shipped. Lets just not set it and in the new system one will be able to set it explicitly via autocargo parameter on a rule.
Reviewed By: ahornby
Differential Revision: D26339606
fbshipit-source-id: 510a01a4dd80b3efe58a14553b752009d516d651
Summary:
The walker mostly checks for duplicates before emitting a new edge, at the same time recording the edge as visited to prevent duplicate edges.
However for derived data where the node may or may not be present, the node isn't considered visited until the node data is successfully loaded and seen in state.rs record_resolved_visit(). This leaves a gap where we could be executing multiple copies of the same node.
Reviewed By: farnz
Differential Revision: D26319139
fbshipit-source-id: 52ce28f15341f132d94ebc1ff5e8ee2f0dc2564a
Summary: Change from .map_ok() to async move to shift code left a bit
Differential Revision: D26366419
fbshipit-source-id: 833066b45702f36a4ce8d579994d1abb2d739f9e
Summary: Its useful to be able to set the repo bounds used for a walk so that issues can be reproduced with the same inputs as a failing chunk.
Differential Revision: D26342439
fbshipit-source-id: b486387be59a3f4d21e3d3dc407420fc339c150d
Summary: The requirement for boxing is in tail.rs so do the boxing there.
Reviewed By: StanislavGlebik
Differential Revision: D26365153
fbshipit-source-id: 65974fd1e90ca9709ba58200a319ffc6b0fae5db
Summary:
Make the walker use blobstore_factory::make_blobstore like other clients do.
This allows blobstore_factory::{make_blobstore_multiplexed, make_blobstore_put_ops} to be hidden.
Reviewed By: krallin
Differential Revision: D25980596
fbshipit-source-id: 2417ed11d4edc611d19e003122acec6b7ebd341d
Summary: Remove these as common cmdlib ones are preferred.
Reviewed By: krallin
Differential Revision: D25976407
fbshipit-source-id: 30295950e51ad1fb2d88cf396b6a7c7353d17577
Summary: It's useful to be able to pass multiple root types e.g. ChangesetInfoMapping and DeletedManifestMapping to a scrub, but clap was set to deny it.
Reviewed By: markbt
Differential Revision: D26279308
fbshipit-source-id: cacc523ee06ccf50bed0b11b73fa8e84e4990eae
Summary: Update walker to use new common cmdlib scrub options if present. These are common across admin, walker and scrub so they were moved up to cmdlib.
Reviewed By: krallin
Differential Revision: D25976408
fbshipit-source-id: 430bb0c6e8b78470afdfc7cebc44c6645492c6fe
Summary: Add option to allow remaining deferred edges at the end of a walker run so that any repos with unresolved edges can still be tailed.
Reviewed By: StanislavGlebik
Differential Revision: D26230927
fbshipit-source-id: 19eed6a616f722d522c7bca30bbe3bc4dae08655