Summary:
Fixing the bulkops fetch size to MAX_FETCH_STEP means we can use the chunk size option to control how many changesets are walked together without affecting query performance.
This will allow more accurate first referencing commit time lookup for files and manifests, as all commits in a chunk could possibly discover them, with smaller chunks the discovered mtime becomes less approximate, at the possible cost of some walk rate performance if one was to run with very small chunk sizes.
Differential Revision: D28120030
fbshipit-source-id: 0010d0672288c6cc4e19f5e51fd8b543a087a74a
Summary: Knowing the numeric changeset id is useful in next diff when chunking in walker is loading from bulkops in large chunks, but then walking commits in smaller chunks.
Differential Revision: D28127581
fbshipit-source-id: c5b3e6c2a94e33833d701540428e1ff4f8898225
Summary: Found this useful while debugging the pack sampling
Differential Revision: D28118243
fbshipit-source-id: d94b0b87125a9863f56f72029c484909a3696329
Summary:
Log the sizing metadata about keys that scrub has seen to the pack info logs.
This uses the sampling blobstore to see all blobstore gets and captures info from them.
Also updates relatedness_key fieldname to mtime as that way its less easily confused with similarity_key
Differential Revision: D28115620
fbshipit-source-id: 666a444c2d91b0ca5bb225cea971f9b183e6a48d
Summary:
Pass BlobstoreGetData to the sampler so that it has a chance to sample the BlobstoreMetadata as well as the BlobstoreBytes.
This is used in the next diff for sampling the sizing information.
Reviewed By: markbt
Differential Revision: D28115619
fbshipit-source-id: 7a79d482c9ba1ed8b08afab5f1c1b8fe7c4f257a
Summary:
The walker does about 50% of our traffic, but it also has fairly predictable
access patterns. It seems unlikely that we really benefit from logging all
scrub activity with the same precision as we do other traffic.
So, let's sample it. This should make a lot of space in our Scuba table and
make us more resilient to sudden changes in activity.
Reviewed By: StanislavGlebik
Differential Revision: D28254057
fbshipit-source-id: da748a565954c31c2d9e087b7b07747a435427bf
Summary: Upstream crate has landed my PR for zstd 1.4.9 support and made a release, so can remove this patch now.
Reviewed By: ikostia
Differential Revision: D28221163
fbshipit-source-id: b95a6bee4f0c8d11f495dc17b2737c9ac9142b36
Summary:
When scrubbing to collect commit times for path info logging, its much easier to get correct commit times for manifests by walking from oldest changeset first. That way when any manifest/tree is discovered its from the closest changeset chunk to its creation.
Alternative would have been using the path data from linknode associated changesets to prune out which sub-manifests to walk when walking forward, which is more complicated and would require holding more state (or reloading changesets continuall)
Differential Revision: D28092314
fbshipit-source-id: 871dc80dd88b63959501dd1018b6466afae5c6c7
Summary:
Previously there were two different paths to HgChangeset. This diff unifies them, so that when walker state.rs is checking for a previous visit it will find that it happened.
For existing walks of changesets in the NewestFirst direction this wasn't causing a problem, however the next diff in stack adds support for OldestFirst walks. In the OldestFirst case the mismatch in paths to HgChangeset was leaving a deferred edge to visit when everything should have been visited in previous chunks.
Differential Revision: D28095569
fbshipit-source-id: ccba4a679fc28bde042cfc222e5097c84fa968c0
Summary: Start logging mtime as relatedness key in the walker scrub pack info output
Differential Revision: D28055637
fbshipit-source-id: 4c24c5f2af0414ae7df17ade69bba9ff18861264
Summary:
We used to carry patches for Tokio 0.2 to add support for disabling Tokio coop
(which was necessary to make Mononoke work with it), but this was upstreamed
in Tokio 1.x (as a different implementation), so that's no longer needed. Nobody
else besides Mononoke was using this.
For Hyper we used to carry a patch with a bugfix. This was also fixed in Tokio
1.x-compatible versions of Hyper. There are still users of hyper-02 in fbcode.
However, this is only used for servers and only when accepting websocket
connections, and those users are just using Hyper as a HTTP client.
Reviewed By: farnz
Differential Revision: D28091331
fbshipit-source-id: de13b2452b654be6f3fa829404385e80a85c4420
Summary:
This used to be used by Mononoke, but we're now on Tokio 1.x and on
corresponding versions of Gotham so it's not needed anymore.
Reviewed By: farnz
Differential Revision: D28091091
fbshipit-source-id: a58bcb4ba52f3f5d2eeb77b68ee4055d80fbfce2
Summary:
Connect up the scrub stream types so they will be uniform for scrubs that log pack info and those that do not.
This is in preprepation for the next diff which connects up the pack info logging of path hashes to scrub. CI for this diff verifies its not broken the non-path tracking case.
Differential Revision: D28031868
fbshipit-source-id: 7bf91eb1778f57487f6a2847f215cf7f5cd2dff7
Summary: This moves evolve_path up to WrappedPathLike so that we can use sample route evolution logic for routes that track paths (e.g. corpus sampling) and path hashes (e.g. scrub, where path hashes take less memory than full paths).
Differential Revision: D28031867
fbshipit-source-id: cdabdc466158a8db1c770536747c996dddb27e71
Summary: Name the fields rather than leave it as a tuple struct. This makes it a bit easier to work with in the rest of the stack
Differential Revision: D28062254
fbshipit-source-id: 9e5202b4d6f1f29d44d98b86aa9b6ddb97d821eb
Summary: Makes more sense for this to be a method on NodeType
Differential Revision: D28031869
fbshipit-source-id: 1ddbafa0d7634ac67fd8d5112e6f57759ed91638
Summary: Name the fields rather than leave it as a tuple struct
Differential Revision: D28031866
fbshipit-source-id: 039f004e0b81294aa6d6b13e79cb45ee2b84567c
Summary: This new trait abstracts across WrappedPath and WrapperPathHash. Later in the stack I make path tracking use this to track either full paths (for corpus sampling) or path hashes (for logging from scrub).
Differential Revision: D28031870
fbshipit-source-id: d1c57230f68fffff179929a3cb92c82d92e0588c
Summary:
The changesets object is only valid to access the changesets of a single repo
(other repos may have different metadata database config), so it is pointless
for all methods to require the caller to provide the correct one. Instead,
make the changesets object remember the repo id.
Reviewed By: krallin
Differential Revision: D27430611
fbshipit-source-id: bf2c398af2e5eb77c1c7c55a89752753020939ab
Summary:
The `get_sql_changesets` method on `Changesets` is an abstraction violation,
and prevents extraction of `SqlChangesets` to a separate crate as it would
introduce a circular dependency.
It is used to allow bulk queries to enumerate changesets by integer unique ID,
so promote this to a full feature of `changesets`, and remove the
`get_sql_changesets` method.
Reviewed By: krallin
Differential Revision: D27426921
fbshipit-source-id: 2839503029b262dd5e6a8be09bb35bb143b4c5ac
Summary:
NOTE: there is one final pre-requisite here, which is that we should default all Mononoke binaries to `--use-mysql-client` because the other SQL client implementations will break once this lands. That said, this is probably the right time to start reviewing.
There's a lot going on here, but Tokio updates being what they are, it has to happen as just one diff (though I did try to minimize churn by modernizing a bunch of stuff in earlier diffs).
Here's a detailed list of what is going on:
- I had to add a number `cargo_toml_dir` for binaries in `eden/mononoke/TARGETS`, because we have to use 2 versions of Bytes concurrently at this time, and the two cannot co-exist in the same Cargo workspace.
- Lots of little Tokio changes:
- Stream abstractions moving to `tokio-stream`
- `tokio::time::delay_for` became `tokio::time::sleep`
- `tokio::sync:⌚:Sender::send` became `tokio::sync:⌚:Sender::broadcast`
- `tokio::sync::Semaphore::acquire` returns a `Result` now.
- `tokio::runtime::Runtime::block_on` no longer takes a `&mut self` (just a `&self`).
- `Notify` grew a few more methods with different semantics. We only use this in tests, I used what seemed logical given the use case.
- Runtime builders have changed quite a bit:
- My `no_coop` patch is gone in Tokio 1.x, but it has a new `tokio::task::unconstrained` wrapper (also from me), which I included on `MononokeApi::new`.
- Tokio now detects your logical CPUs, not physical CPUs, so we no longer need to use `num_cpus::get()` to figure it out.
- Tokio 1.x now uses Bytes 1.x:
- At the edges (i.e. streams returned to Hyper or emitted by RepoClient), we need to return Bytes 1.x. However, internally we still use Bytes 0.5 in some places (notably: Filestore).
- In LFS, this means we make a copy. We used to do that a while ago anyway (in the other direction) and it was never a meaningful CPU cost, so I think this is fine.
- In Mononoke Server it doesn't really matter because that still generates ... Bytes 0.1 anyway so there was a copy before from 0.1 to 0.5 and it's from 0.1 to 1.x.
- In the very few places where we read stuff using Tokio from the outside world (historical import tools for LFS), we copy.
- tokio-tls changed a lot, they removed all the convenience methods around connecting. This resulted in updates to:
- How we listen in Mononoke Server & LFS
- How we connect in hgcli.
- Note: all this stuff has test coverage.
- The child process API changed a little bit. We used to have a ChildWrapper around the hg sync job to make a Tokio 0.2.x child look more like a Tokio 1.x Child, so now we can just remove this.
- Hyper changed their Websocket upgrade mechanism (you now need the whole `Request` to upgrade, whereas before that you needed just the `Body`, so I changed up our code a little bit in Mononoke's HTTP acceptor to defer splitting up the `Request` into parts until after we know whether we plan to upgrade it.
- I removed the MySQL tests that didn't use mysql client, because we're leaving that behind and don't intend to support it on Tokio 1.x.
Reviewed By: mitrandir77
Differential Revision: D26669620
fbshipit-source-id: acb6aff92e7f70a7a43f32cf758f252f330e60c9
Summary:
MyRouter is no longer used by Mononoke services, it is deprecated and will stop working when we upgrade the tokio.
This diff removes MyRouter support from Mononoke and simplifies the Mysql connection type struct.
Before we had `MysqlOptions` and `MysqlConnectionType` enum to represent what kind of a client we want to use. Now we use only MySQL FFI so I removed `MysqlConnectionType` completely and put everything into the options struct.
As setting up the connections (aka conn pool) is not an async operation, some of the methods don't need to be async anymore. Because this diff is already enormous, I'm refactoring this in the next one.
Reviewed By: StanislavGlebik
Differential Revision: D28007850
fbshipit-source-id: 32c3740f4bb132f06e1e256b0530ace755446cdd
Summary:
HgManifestFileNode is one of the last remaining types we don't walk ( other known one is the git derived data).
Its added as a separate NodeType from HgFileNode as HgManifestFileNode is used much less and users may want to see only the HgFileNodes. Server side the manifest file node is only used to build the bundles returned to the client.
Differential Revision: D28010248
fbshipit-source-id: ce4c773b0f1996df308f1b271890f29947c2c304
Summary:
First, some background on the existing WrappedPath type: In Mononoke the MPath type is such that None==Root and Some(MPath)==NonRoot. This means that where a path may be present one needs to use double-Option with Option<Option<MPath>>, so that Root is Some(None).
To reduce the need for double Option, and subsequently to allow for newtype features like memoization, the walker has WrappedPath, so we can use Option<WrappedPath> instead.
This change introduces a similar type WrappedPathHash for MPathHash, which means that the sample_fingerprint for WrappedPath can be now be non-optional as even root paths/manifests can now have a sample_fingerprint.
Reviewed By: mitrandir77
Differential Revision: D27995143
fbshipit-source-id: b674abd4ec94749f4f5797c697ae7381e1a08d02
Summary:
This adds the first part of new logging from the walker that can be used to gather details on what keys might make sense to pack together.
Unlike the corpus command that dumps file content by path (which was useful for analysis on compression approaches), this adds logging to the scrub command that includes the path hash rather than the full path. This should keep memory usage down during the run, hopefully mean we log from existing scrub jobs and and mean the logs are more compact.
Reviewed By: mitrandir77
Differential Revision: D27974448
fbshipit-source-id: 47b55112b47e9b022f16fbb473cf233a7d46bcf3
Summary:
Update the zstd crates.
This also patches async-compression crate to point at my fork until upstream PR https://github.com/Nemo157/async-compression/pull/117 to update to zstd 1.4.9 can land.
Reviewed By: jsgf, dtolnay
Differential Revision: D27942174
fbshipit-source-id: 26e604d71417e6910a02ec27142c3a16ea516c2b
Summary: Its more efficient to bulk load the mappings for a chunk than do the queries one by one
Differential Revision: D27801830
fbshipit-source-id: 9c38ddfb1c1d827fc3028cd09f9ad51e3cbee5dc
Summary: Add an accessor so that we keep a reverse mapping of the WalkState::bcs_to_hg member as a cache of bonsai to hg mappings and also populate it on derivations.
Differential Revision: D27800533
fbshipit-source-id: f9b1c279a78ce3791013c3c83a32251fdc3ad77f
Summary: Add an accessor so that we can use the WalkState::hg_to_bcs member as a cache of hg to bonsai mappings
Reviewed By: farnz
Differential Revision: D27797638
fbshipit-source-id: 44322e93849ea78b255b2e3cb05feb8db6b4c7a7
Summary:
There is a very frustrating operation that happens often when working on the
Mononoke code base:
- You want to add a flag
- You want to consume it in the repo somewhere
Unfortunately, when we need to do this, we end up having to thread this from a
million places and parse it out in every single main() we have.
This is a mess, and it results in every single Mononoke binary starting with
heaps of useless boilerplate:
```
let matches = app.get_matches();
let (caching, logger, mut runtime) = matches.init_mononoke(fb)?;
let config_store = args::init_config_store(fb, &logger, &matches)?;
let mysql_options = args::parse_mysql_options(&matches);
let blobstore_options = args::parse_blobstore_options(&matches)?;
let readonly_storage = args::parse_readonly_storage(&matches);
```
So, this diff updates us to just use MononokeEnvironment directly in
RepoFactory, which means none of that has to happen: we can now add a flag,
parse it into MononokeEnvironment, and get going.
While we're at it, we can also remove blobstore options and all that jazz from
MononokeApiEnvironment since now it's there in the underlying RepoFactory.
Reviewed By: HarveyHunt
Differential Revision: D27767700
fbshipit-source-id: e1e359bf403b4d3d7b36e5f670aa1a7dd4f1d209
Summary:
ScrubOptions normally represents options we parsed from the CLI, but right now
we abuse this a little bit to throw a ScrubHandler into them, which we
sometimes mutate before using this config.
In this stack, I'm unifying how we pass configs to RepoFactory, and this little
exception doesn't really fit. So, let's change this up, and make ScrubHandler
something you may give the RepoFactory if you're so inclined.
Reviewed By: HarveyHunt
Differential Revision: D27767699
fbshipit-source-id: fd38bf47eeb723ec7d62f8d34e706d8581a38c43
Summary:
Basically every single Mononoke binary starts with the same preamble:
- Init mononoke
- Init caching
- Init logging
- Init tunables
Some of them forget to do it, some don't, etc. This is a mess.
To make things messier, our initialization consists of a bunch of lazy statics
interacting with each other (init logging & init configerator are kinda
intertwined due to the fact that configerator wants a logger but dynamic
observability wants a logger), and methods you must only call once.
This diff attempts to clean this up by moving all this initialization into the
construction of MononokeMatches. I didn't change all the accessor methods
(though I did update those that would otherwise return things instantiated at
startup).
I'm planning to do a bit more on top of this, as my actual goal here is to make
it easier to thread arguments from MononokeMatches to RepoFactory, and to do so
I'd like to just pass my MononokeEnvironment as an input to RepoFactory.
Reviewed By: HarveyHunt
Differential Revision: D27767698
fbshipit-source-id: 00d66b07b8c69f072b92d3d3919393300dd7a392
Summary:
Use `RepoFactory` to construct repositories in the walker.
The walker previously had special handling to allow repositories to
share metadata database and blobstore connections. This is now
implemented in `RepoFactory` itself.
Reviewed By: krallin
Differential Revision: D27400616
fbshipit-source-id: e16b6bdba624727977f4e58be64f8741b91500da
Summary:
We weren't validating any hashes in walker, and so it was entirely possible to
have corrupt content in the repo. Let's start hash validation it by validating hg
filenodes
Reviewed By: ahornby
Differential Revision: D27459516
fbshipit-source-id: 495d59436773d76fcd2ed572e1e724b01012be93
Summary:
When going into a root unode we check if fastlog is derived or not. if it's not
derived we don't even try to check fastlog dirs/files, and that's correct.
However that means that if fastlog batch for a given commit is derived than
we could (and should!) check if all fastlog batches referenced from this
commit exist.
Previously we weren't doing that and this diff fixes it. Same applies for hg
filenode.
Reviewed By: ahornby
Differential Revision: D27401597
fbshipit-source-id: 71ad2744eee33208c44447163cf77bc95ffe98d0
Summary:
Current we treat missing just as any other error, and that makes it hard to
have an alarm on blobs that are missing, because any transient error might make
this alarm go off.
Let's instead differentiate between a blob being missing and failed to fetch a
blob because of any other error (i.e. manifold is unavaialble). Turned out this
is not too difficult to do because a lot of the types implement Loadable trait,
which has Missing variant in its errors.
side-note:
It looks like we have 3 steps which treat missing blobs as not being an
error. These steps are:
1) fastlog_dir_step
2) fastlog_file_step
3) hg_filenode_step
I think this is wrong, and I plan to fix it in the next diff
Reviewed By: ahornby
Differential Revision: D27400280
fbshipit-source-id: e79fff25c41e4d03d77b72b410d6d2f0822c28fd
Summary:
This situation is not normally possible - every hg changeset should have a
corresponding bonsai. So let's return an error if that's not the case.
Reviewed By: farnz
Differential Revision: D27400281
fbshipit-source-id: 4b01b973eeef0e3336c187fb90dd2ab4853b5c02
Summary: Record some more stats so we can see last finish time. Also record update stats for run and chunk number so can see how far along a run is.
Differential Revision: D26949482
fbshipit-source-id: 5e7df4412c25149559883b6e15afa70e1c670cdc
Summary:
The existing query to establish HgChangesetId on the path to FileContentMetadata for LFS validation is quite complex, using HgFilenode linknodes.
This change adds an optional edge from BonsaiHgMappingToHgBonsaiMapping that can be used to simplify the LFS validation case and load less data to get there.
Reviewed By: mitrandir77
Differential Revision: D26975799
fbshipit-source-id: 799acb8228721c1878f33254ebfa5e6345673e5d
Summary:
Like it says in the title. We probably need to find a better way to make sure
we have this everywhere, but for now let's add it here because it's missing.
Reviewed By: StanislavGlebik
Differential Revision: D26949401
fbshipit-source-id: 9325e0a367a1a41fed8997a3a13e0764b9d77e2f
Summary:
Walker args to track correct the originating HgChangeset through history are a bit complicated, add a test for them.
Test showed a bug that HgBonsaiMapping wasn't being tracked in ValidateRoute, so added support for that.
Reviewed By: mitrandir77
Differential Revision: D26945254
fbshipit-source-id: 372574b5e9cde530ba8aecaf1bdc7c7d8aaee54b
Summary: Surprisingly this wasn't aleady in the glog output. Added it so easier to correlate with other logs.
Reviewed By: StanislavGlebik
Differential Revision: D26946047
fbshipit-source-id: b2f6b8097bd1ea6e18a79aa9ac0363582c858d55