mirror of
https://github.com/facebook/sapling.git
synced 2024-10-10 16:57:49 +03:00
099856d71c
Summary: In a very mergy repos we can hit a combinatoric explosion by visiting the same node over and over again. Derived data framework has the same problem, and this diff fixes it. I had a few attempts at implementing it: **1** Use `bounded_traversal`, but change unfold to filter out parents that were already visited. That wasn't correct because during fold will be called only with the "unvisited" parents. For example in a case like ``` D / \ C B \ / A ``` fold for C or B will be called with empty parents, and that's incorrect. **2** Use `bounded_traversal`, change unfold to filter out visited parents but also remember real parents. That won't work as well. The reason is that fold might be called before unfold for parents have finished. so in the case like ``` D / \ C B \ / A | ... thousands of commits ``` If C reaches A first, then B won't visit any other node, and it will try to derive data for B. However derived data for A might not be ready yet, so deriving data for B might fail. **3** Change bounded_traversal to support DAGs not just tree. From two points above it's clear that bounded_traversal should be called bounded_tree_traversal, because on any other DAGs it might hit combinatoric explosion. I looked into changing bounded_traversal to support DAGs, and that was possible but that was not easy. Specifically we need to make sure that all unfold operations are called after fold operations, stop using integers for nodes etc. It might also have a perf hit for the tree case, but not clear how big is it. While I think supporting DAGs in bounded_traversal makes sense, I don't want to block derived data implementation on that. I'll create a separate task for that --------------------------------------------------------------------------------- The approach I took in the end was to use bounded_stream_traversal that don't visit the same node twice. Doing this will find all commits that need to be regenerated but it might return them in an arbitrary order. After that we need to topo_sort the commits (note that I introduced the bug for hg changeset generation in D16132403, so this diff fixes it as well). This is not the most optimal implementation because it will generate the nodes sequentially even if they can be generated in parallel (e.g. if the nodes are in different branches). I don't think it's a huge concern so I think it worth waiting for bounded_dag_traversal implementation (see point 3) above) --------------------------------------------------------------------------------- Finally there were concerns about memory usage from additional hashset that keeps visited nodes. I think these concerns are unfounded for a few reasons: 1) We have to keep the nodes we visited *anyway* because we need to generated derived data from parents to children. In fact, bounded_traversal keeps them in the map as well. That's true that bounded traversal can do it a bit more efficiently in cases we have two different branches that do not intersect. I'd argue that's a rare case and happens only on repo merges which have two independent equally sized branches. But even for the case it's not a huge problem (see below). 2) Hashset just keep commit ids which are 32 bytes long. So even if we have 1M commits to generate that would take 32Mb + hashset overhead. And the cases like that should never happen in the first place - we do not expect to generate derived data for 1M of commits except for the initial huge repo imports (and for those cases we can afford 30Mb memory hit). If we in the state where we need to generate too many commits we should just return an error to the user, and we'll add it in the later diffs. Reviewed By: krallin Differential Revision: D16438342 fbshipit-source-id: 4d82ea6111ac882dd5856319a16dda8392dfae81 |
||
---|---|---|
.. | ||
derive_impl.rs | ||
lib.rs |