Summary: preparation for multi repo, get the repo name into ErrorKind::NotTraversable
Reviewed By: StanislavGlebik
Differential Revision: D25541444
fbshipit-source-id: 8fd99d5d3f144d8a3a72c7c33205ae58bd5f1ae2
Summary:
In preparation for having the walker able to scrub multiple repos at once, define parameter structs. This also simplifies the code in tail.rs.
The param objects are:
* RepoSubcommandParams - per repo params that can be setup in setup_common and are consumed in the subcommand. They don't get passed through to the walk
* RepoWalkParams - per repo params that can be setup in setup_common and will get passed all the way into the walk.rs methods
* JobWalkParams - per job params that at can be setup in setup_common and will get passed all the way into the walk.rs methods
* TypeWalkParams - per repo params that need to be setup in the subcommands, and are passed all the way into walk.rs
Reviewed By: StanislavGlebik
Differential Revision: D25524256
fbshipit-source-id: bfc8e087e386b6ed45121908b48b6535f65debd3
Summary: parsing of progress options an sampling options was same in each subcommand, move them to functions in setup.rs
Reviewed By: StanislavGlebik
Differential Revision: D25524255
fbshipit-source-id: a2f48814f24aa9b3a158cb7d4abbfc2c0c338305
Summary: Simplify open_blobrepo_given_datasources parameters to pass less arguments, make it so can pass the sql_factory by reference.
Reviewed By: krallin
Differential Revision: D25524254
fbshipit-source-id: c324127f42c53a52f388d303e310014f4fa0d7bb
Summary: Allows the walker blobstore code to be used by more than one blobrepo. This is a step to reduce the number of jobs needed to scrub small repos.
Reviewed By: StanislavGlebik
Differential Revision: D25422937
fbshipit-source-id: e2d11239f172f50680bb6e10dd60026c9e6c3c3d
Summary:
By doing the hg to hg steps via bonsai I will later introduce a check if the bonsai is in the current chunk of commits to be processed as part of allowing walker checkpoint and restart.
On its own this is a minor change to the number of nodes the walk will cover as seen in the updated tests.
Reviewed By: krallin
Differential Revision: D25394085
fbshipit-source-id: 3e50cf76c7032635ce9e6a7375228979b2e9c930
Summary: This is in preparation for all walker hg to hg steps (e.g HgChangeset to Parent HgChangeset) going via Bonsai, which without this would continually check if the filenodes are derived
Reviewed By: krallin
Differential Revision: D25394086
fbshipit-source-id: bb75e7ddf5b09f9d13a0f436627f4c3c95e24430
Summary: Because mysql connection pool options had both `conflicts_with(myrouter)` and default values, the binary always failed if myrouter option was provided.
Differential Revision: D25639679
fbshipit-source-id: 21ebf483d4ee88a05db519a14b7e2561b3089ad1
Summary:
A bit of history first. For some time we had a problem in our cross repo sync
library where it used the "current" commit sync version, where "current" meant
"the latest commit sync config version that was added". That was incorrect, and
we migrated away from this model, however there were still a few places that
used get_current_mover_DEPRECATED() mover.
Removing this method from a test file is easy, but it's trickier for
sync_diamond_merge tool. This tool is used to sync a diamond merge from a small
repo to a large repo (this is non-trivial to do, so we don't do it
automatically). To make things simpler this diff requires all involved commits
(i.e. both parents, where to rebase (onto) and root of the diamond merge) to
have the same commit sync config versions.
Reviewed By: markbt
Differential Revision: D25612492
fbshipit-source-id: 6483eed9698551920fb1cf240218db7b7e78f7bd
Summary:
The correct workflow for using a multi-threaded connection pool for multiple DBs is to have a single shared pool for all the use-cases. The pool is smart enough to maintain separate "pools" for each DB locator and limit them to maximum 100 conn per key.
In this diff I create a `OnceCell` connection pool that is initialized once and reused for every attempt to connect to the DB.
The pool is stored in `MononokeAppData` in order to bind its lifetime to the lifetime of Mononoke app. Then it is passed down as a part of `MysqlOptions`. Unfortunately this makes `MysqlOptions` not copyable, so the diff also contains lots of "clones".
Reviewed By: ahornby
Differential Revision: D25055819
fbshipit-source-id: 21f7d4a89e657fc9f91bf22c56c6a7172fb76ee8
Summary:
In the next diff I'm going to add Mysql connection object to `MysqlOptions` in order to pass it down from `MononokeAppData` to the code that works with sql.
This change will make MysqlOptions un-copyable.
This diff fixed all issues produced by the change.
Reviewed By: ahornby
Differential Revision: D25590772
fbshipit-source-id: 440ae5cba3d49ee6ccd2ff39a93829bcd14bb3f1
Summary:
benchmark_filestore XDB subcommands uses mysql and has option of using either myrouter or mysql. In this diff I used `args::parse_mysql_options` function to parse the arguments instead of manual processing and get a `MysqlOptions` object.
This is needed later to pass a connection pool object through the `MysqlOptions` struct (see the next diff).
Reviewed By: ahornby
Differential Revision: D25587898
fbshipit-source-id: 66fcfd98ad8f3f9e285ca9635d8f625aa680d7ff
Summary:
Like it says in the title. This is nice to do because we had old futures
wrapping new futures here, so this lets us get rid of a lot of cruft.
Reviewed By: ahornby
Differential Revision: D25502648
fbshipit-source-id: a34973b32880d859b25dcb6dc455c42eec4c2f94
Summary:
This was kinda almost done. Might as well finish it by updating what's left,
i.e. the tests.
Reviewed By: ahornby
Differential Revision: D25498799
fbshipit-source-id: 65b7b144f5cf86d5f1754f5c7dafe373173b5ece
Summary: Let's not spawn too many futures at once
Reviewed By: markbt
Differential Revision: D25612069
fbshipit-source-id: e48901b981b437f66573a1abfba08eb144af2377
Summary: Forgot to add them when I wrote the test. Let me add tem now
Differential Revision: D25611802
fbshipit-source-id: 0db7bee2034ad6e1566c5eb6de2e80e18140d757
Summary: Convert all BlobRepoHg methods to new type futures
Reviewed By: StanislavGlebik
Differential Revision: D25471540
fbshipit-source-id: c8e99509d39d0e081d082097cbd9dbfca431637e
Summary:
The goal of this stack is to start logging commits to scribe even if a commit was
introduced by scs create_commit/move_bookmark api. Currently we don't do that.
Initially I had bigger plans and I wanted to log to scribe only from bookmarks_movement and remove scribe logging from unbundle/processing.rs, but it turned out to be trickier to implement. In general, the approach we use right now where in order to log to scribe we need to put `log_commit_to_scribe` call in all the places that can possibly create commits/move bookmarks seems wrong, but changing it is a bit hard. So for now I decided to solve the main problem we have, which is the fact that we don't get scribe logs from repos where bookmarks is moved via scs methods.
To fix that I added an additional option to CreateBookmark/UpdateBookmark structs. If this option is set to true then before moving/creating the bookmark it finds all draft commits that are going to be made public by the bookmark move i.e. all draft ancestors of new bookmark destinationl. This is unfortunate that we have to do this traversal on the critical path of the move_bookmark call, but in practice I hope it won't be too bad since we do similar traversal to record bonsai<->git mappings. In case my hopes are wrong we have scuba logging which should make it clear that this is an expensive operation and also we have a tunable to disable this behavioiur.
Also note that we don't use PushParams::commit_scribe_category. This is intentional - PushParams::commit_scribe_category doesn't seem useful at all, and I plan to deprecate it later in the stack. Even later it would make sense to deprecate PushrebaseParams::commit_scribe_category as well, and put commit_scribe_category optoin in another place in the config.
Reviewed By: markbt
Differential Revision: D25558248
fbshipit-source-id: f7dedea8d6f72ad40c006693d4f1a265977f843f
Summary:
Those messages like "pulling from ...", "added n commits ..." belong to stderr.
This makes it possible for us to turn on verbose output for auto pull, without
breaking tools that parses stdout.
Reviewed By: sfilipco
Differential Revision: D25315955
fbshipit-source-id: 933f631610840eb5f603ad817f7560c78b19e4ad
Summary:
When tailing to fill or backfill derived data, omit checking the heads from the
previous round of derivation, as we know for sure they've been derived.
Reviewed By: krallin
Differential Revision: D25465445
fbshipit-source-id: 384c7e67e99c561ce6aae324070e7c274c56b736
Summary:
Rustfmt gives up on formatting if strings are too long. Split the long help
strings so that the formatter works again.
Tidy up some of the help text while we're at it.
Reviewed By: krallin
Differential Revision: D25465443
fbshipit-source-id: 360dbedc1e3e2ffbc489a9d9cba008835bce506f
Summary:
ChangesetInfo derivation doesn't depend on the parent ChangesetInfo being
available, so we can derive them in batches very easily.
Reviewed By: krallin
Differential Revision: D25470721
fbshipit-source-id: cc8ce305990eb6c9846158f0e9e3917cf35e169d
Summary:
Add documentation comments to the derived data crate to describe how they fit
together.
Reviewed By: krallin
Differential Revision: D25432449
fbshipit-source-id: b62440bcecae900ad75d74245ce175bd9e07a894
Summary:
Using `backfill-all` on very large repositories is slow to get started and slow
to resume, as it must traverse the repository history all the way to the start
before it can even begin.
Make this more usable by using the skiplist index to slice the repository into
reasonably sized slices of heads with the same range of generation numbers.
Each slice is then derived in turn. If interrupted, derivation can continue
at the next slice more quickly.
Reviewed By: krallin
Differential Revision: D25371968
fbshipit-source-id: f150ea847f9fbbe84852587d620ae37ba2c58f28
Summary:
Right now, when we upload a hg commit, we check that we have all the content
the client is referencing.
The only problem is we do this by checking that the filenodes the client
mentions exist, but the way we store filenodes right now is we write them
concurrently with content blobs, so it is in fact possible to have a filenode
that references a piece of content that doesn't actually exist.
That isn't quite what one might call satisfactory when it comes to checking the
content does in fact exist, so this diff updates our content checking
In practice, with the way Mononoke works right now this should be quite free:
the client uploads everything all the time, and then we check later, so this
will just hit in the blobstore cache.
In a future world where clients don't upload stuff they already know we have,
that could be slower, but doing it the way is we do it before is simply not
correct, so that's not much better. The ways to make it faster would be:
- Trust that we'll hit in cache when checking for presence (could work out).
- Have the client prove to us that we have this content, and thread it through.
To do the latter, IMO the code here could actually look at the entries that
were actually uploaded, and not check them for presence again, but right now we
have a few layers of indirection that make this a bit tricky (technically, when
`process_one_entry` gets called, that means "I uploaded this", but nothing in
the signature of any of the functions involved really enforces that).
Reviewed By: StanislavGlebik
Differential Revision: D25422596
fbshipit-source-id: 3cf34d38bd6ed1cd83d93c778f04395c942b26c0
Summary:
Add a resolve_repos function to cmdlib::args for use from jobs that will run for multiple repos at once.
Planning to use this form the walker to scrub multiple small repos from single process
Differential Revision: D25422755
fbshipit-source-id: 40e5d499cf1068878373706fdaa72effd27e9625
Summary:
This diff does a small refactoring to hopefully make the code a bit clearer.
Previously we were calling log_commits_to_scribe in force_pushrebase and in
run_pushrebase, so log_commits_to_scribe was called twice. It didn't mean that
the commits were logged twice but it was hard to understand just by looking at
the code.
Let's make it a bit clearer.
Reviewed By: krallin
Differential Revision: D25555712
fbshipit-source-id: bed9754b1645008846a86da665b6f3f3483f30da
Summary:
I'm going to do a refactoring later in the stack, so let's add a test to avoid
regressions.
Reviewed By: krallin
Differential Revision: D25535655
fbshipit-source-id: 5ec6633c9c8c25d1affcede0adbc27dd43c48736
Summary:
Add open_existing_sqlite_path so we don't doing create_dir_all when we know the db already exists.
Noticed it in passing while investigating something else.
Reviewed By: markbt
Differential Revision: D25469502
fbshipit-source-id: 9810489c84220927937c037d69f5e8e70f2d9038
Summary:
If Rust LFS is in use, we currently don't upload LFS blobs to commit cloud.
This is problematic because if you're going to Mononoke that means you can't
upload, and if you're going to Mercurial that means you're silently not backing
up data.
Reviewed By: StanislavGlebik
Differential Revision: D25537672
fbshipit-source-id: fd61f5a69450c97a0bc0895193f67fd22c9773fb
Summary:
The old case-conflict checks were more lenient, and only triggered if a commit
introduced a case conflict compared to its first parent.
This means that commits could still be landed to bookmarks that already had
pre-existing case conflicts.
Relax the new case-conflict checks to allow this same scenario.
Note that we're still a bit more strict: the previous checks ignored other
parents, and would not reject a commit if the act of merging introduces a case
conflict. The new case conflict checks only permit case conflicts in the case
where all conflicting files were present in one of the parents.
Reviewed By: StanislavGlebik
Differential Revision: D25508845
fbshipit-source-id: 95f4db1300ee73b8e6495ba8b5c1c2ce5a957d1a
Summary: Spotted this in passing. Was able to remove a call to fetch_root_manifest_id.
Reviewed By: StanislavGlebik
Differential Revision: D25472678
fbshipit-source-id: d450cb97630464be13d22fb37c3356611dc2e1b6
Summary: This makes it easier to run full walks on small repos.
Reviewed By: StanislavGlebik
Differential Revision: D25469485
fbshipit-source-id: 6e5b1426837a396d939e47a5b353e615437ae7cb
Summary: Made unnecessary by using 2018 edition for rust_bindgen_library targets (D25441322).
Reviewed By: jsgf
Differential Revision: D25441329
fbshipit-source-id: d00aaad09451c77c6d05ed5d671468a481ce4e25
Summary:
Some of our tests are marked as flaky because of timeouts. They usually run
under timeout but because it's integration tests the times vary a lot.
The `#requre slow` is marking those tests as slow giving them more time to tun.
Reviewed By: krallin
Differential Revision: D25495555
fbshipit-source-id: 02bc3755992f56f5e743835318cf1233ab7c623d
Summary:
Now that the mapping is separated from BonsaiDerivable, it becomes clear where
batch derivation is incorrectly using the default mapping, rather than the
mapping that has been provided for batch-derivation.
This could mean, for example, if we are backfilling a v2 of these derived
data types, we could accidentally base them on a v1 parent that we obtained
from the default mapping.
Instead, ensure that we don't use `BonsaiDerived` and thus can't use the
default mapping.
Reviewed By: krallin
Differential Revision: D25371963
fbshipit-source-id: fb71e1f1c4bd7a112d3099e0e5c5c7111d457cd2
Summary:
The backfiller may read or write to the blobstore too quickly. Apply QPS
limits to the backfill batch context to keep the read or write rate acceptable.
Reviewed By: ahornby
Differential Revision: D25371966
fbshipit-source-id: 276bf2dd428f7f66f7472aabd9e943eec5733afe
Summary:
The common case of limiting blobstore rates using a leaky bucket rate limiter
is cumbersone to set up. Create a convenience method to do it more easily.
Reviewed By: ahornby
Differential Revision: D25438685
fbshipit-source-id: 821eda7bd0ddf71f22378c1b23e66b6d3f6454e7
Summary:
When fetching many derived data mappings, the use of `FuturesUnordered` means
we may fetch many blobs concurrently, which may overload the blobstore.
Switch to using `buffered` to reduce the number of concurrent blob fetches.
Reviewed By: ahornby
Differential Revision: D25371965
fbshipit-source-id: 30417e86bc33defbb821f214a5520ab1b8a8c18c
Summary:
Large batches with parallel derivation can cause problems in large repos.
Allow control of the batch size so that it can be reduced if needed.
Reviewed By: krallin
Differential Revision: D25401205
fbshipit-source-id: 88a76a7745c34e4e34bc9b3ea9228bd5dad857f6
Summary:
Re-introduce parallel backfilling of changesets in a batch using `batch_derive`,
however keep it under the control of a flag, so we can enable or disable it as
necessary.
Reviewed By: ahornby
Differential Revision: D25401207
fbshipit-source-id: f9aeef3415be48fc03220c18fa547e05538ed479
Summary:
Change derived data config to have "enabled" config and "backfilling" config.
The `Mapping` object has the responsibility of encapsulating the configuration options
for the derived data type. Since it is only possible to obtain a `Mapping` from
appropriate configuration, ownership of a `Mapping` means derivation is permitted,
and so the `DeriveMode` enum is removed.
Most callers will use `BonsaiDerived::derive`, or a default `derived_data_utils` implementation
that requires the derived data to be enabled and configured on the repo.
Backfillers can additionally use `derived_data_utils_for_backfill` which will use the
`backfilling` configuration in preference to the default configuration.
Reviewed By: ahornby
Differential Revision: D25246317
fbshipit-source-id: 352fe6509572409bc3338dd43d157f34c73b9eac
Summary:
Currently, data derivation for types that have options (currrently unode
version and blame filesize limit) take the value of the option from the
repository configuration.
This is a side-effect, and means it's not possible to have data derivation
types with different configs active in the same repository (e.g. to
server unodes v1 while backfilling unodes v2). To have data derivation
with different options, e.g. in tests, we must use `repo.dangerous_override`.
The first step to resolve this is to make the data derivation options a parameter.
Depending on the type of derived data, these options are passed into
`derive_from_parents` so that the right kind of derivation can happen.
The mapping is responsible for storing the options and providing it at the time
of derivation. In this diff it just gets it from the repository config, the same
as was done previously. In a future diff we will change this so that there
can be multiple configurations.
Reviewed By: krallin
Differential Revision: D25371967
fbshipit-source-id: 1cf4c06a4598fccbfa93367fc1f1c2fa00fd8235