sapling

mirror of https://github.com/facebook/sapling.git synced 2024-10-10 16:57:49 +03:00

History

Stanislau Hlebik 099856d71c mononoke: avoid combinatoric explosion in derived_data framework Summary: In a very mergy repos we can hit a combinatoric explosion by visiting the same node over and over again. Derived data framework has the same problem, and this diff fixes it. I had a few attempts at implementing it: 1 Use `bounded_traversal`, but change unfold to filter out parents that were already visited. That wasn't correct because during fold will be called only with the "unvisited" parents. For example in a case like ``` D / \ C B \ / A ``` fold for C or B will be called with empty parents, and that's incorrect. 2 Use `bounded_traversal`, change unfold to filter out visited parents but also remember real parents. That won't work as well. The reason is that fold might be called before unfold for parents have finished. so in the case like ``` D / \ C B \ / A \| ... thousands of commits ``` If C reaches A first, then B won't visit any other node, and it will try to derive data for B. However derived data for A might not be ready yet, so deriving data for B might fail. 3 Change bounded_traversal to support DAGs not just tree. From two points above it's clear that bounded_traversal should be called bounded_tree_traversal, because on any other DAGs it might hit combinatoric explosion. I looked into changing bounded_traversal to support DAGs, and that was possible but that was not easy. Specifically we need to make sure that all unfold operations are called after fold operations, stop using integers for nodes etc. It might also have a perf hit for the tree case, but not clear how big is it. While I think supporting DAGs in bounded_traversal makes sense, I don't want to block derived data implementation on that. I'll create a separate task for that --------------------------------------------------------------------------------- The approach I took in the end was to use bounded_stream_traversal that don't visit the same node twice. Doing this will find all commits that need to be regenerated but it might return them in an arbitrary order. After that we need to topo_sort the commits (note that I introduced the bug for hg changeset generation in D16132403, so this diff fixes it as well). This is not the most optimal implementation because it will generate the nodes sequentially even if they can be generated in parallel (e.g. if the nodes are in different branches). I don't think it's a huge concern so I think it worth waiting for bounded_dag_traversal implementation (see point 3) above) --------------------------------------------------------------------------------- Finally there were concerns about memory usage from additional hashset that keeps visited nodes. I think these concerns are unfounded for a few reasons: 1) We have to keep the nodes we visited anyway because we need to generated derived data from parents to children. In fact, bounded_traversal keeps them in the map as well. That's true that bounded traversal can do it a bit more efficiently in cases we have two different branches that do not intersect. I'd argue that's a rare case and happens only on repo merges which have two independent equally sized branches. But even for the case it's not a huge problem (see below). 2) Hashset just keep commit ids which are 32 bytes long. So even if we have 1M commits to generate that would take 32Mb + hashset overhead. And the cases like that should never happen in the first place - we do not expect to generate derived data for 1M of commits except for the initial huge repo imports (and for those cases we can afford 30Mb memory hit). If we in the state where we need to generate too many commits we should just return an error to the user, and we'll add it in the later diffs. Reviewed By: krallin Differential Revision: D16438342 fbshipit-source-id: 4d82ea6111ac882dd5856319a16dda8392dfae81	2019-07-24 11:55:34 -07:00
..
derive_impl.rs	mononoke: avoid combinatoric explosion in derived_data framework	2019-07-24 11:55:34 -07:00
lib.rs	mononoke: basic derived data implementation	2019-07-24 07:35:41 -07:00

Stanislau Hlebik 099856d71c mononoke: avoid combinatoric explosion in derived_data framework

Summary:
In a very mergy repos we can hit a combinatoric explosion by visiting the same
node over and over again. Derived data framework has the same problem, and this diff
fixes it.

I had a few attempts at implementing it:

**1** Use `bounded_traversal`, but change unfold to filter out parents that were already visited.
That wasn't correct because during fold will be called only with
the "unvisited" parents. For example in a case like

```
D
/ \
C B
\ /
A
```

fold for C or B will be called with empty parents, and that's incorrect.

**2** Use `bounded_traversal`, change unfold to filter out visited parents but
also remember real parents.

That won't work as well. The reason is that fold might be called before unfold
for parents have finished. so in the case like

```
D
/ \
C B
\ /
A
|
...
thousands of commits
```

If C reaches A first, then B won't visit any other node, and it will try to
derive data for B. However derived data for A might not be ready yet, so
deriving data for B might fail.

**3** Change bounded_traversal to support DAGs not just tree.

From two points above it's clear that bounded_traversal should be called
bounded_tree_traversal, because on any other DAGs it might hit combinatoric
explosion. I looked into changing bounded_traversal to support DAGs, and that
was possible but that was not easy. Specifically we need to make sure that all
unfold operations are called after fold operations, stop using integers for
nodes etc. It might also have a perf hit for the tree case, but not clear how
big is it.
While I think supporting DAGs in bounded_traversal makes sense, I don't want to
block derived data implementation on that. I'll create a separate task for that

---------------------------------------------------------------------------------

The approach I took in the end was to use bounded_stream_traversal that don't
visit the same node twice. Doing this will find all commits that need to be
regenerated but it might return them in an arbitrary order. After that we need
to topo_sort the commits (note that I introduced the bug for hg changeset
generation in D16132403, so this diff fixes it as well).

This is not the most optimal implementation because it will generate the nodes
sequentially even if they can be generated in parallel (e.g. if the nodes are
in different branches). I don't think it's a huge concern so I think it worth
waiting for bounded_dag_traversal implementation (see point 3) above)

---------------------------------------------------------------------------------

Finally there were concerns about memory usage from additional hashset that
keeps visited nodes. I think these concerns are unfounded for a few reasons:

1) We have to keep the nodes we visited *anyway* because we need to generated
derived data from parents to children. In fact, bounded_traversal keeps them in
the map as well.
That's true that bounded traversal can do it a bit more efficiently in cases
we have two different branches that do not intersect. I'd argue that's a rare
case and happens only on repo merges which have two independent equally sized
branches. But even for the case it's not a huge problem (see below).

2) Hashset just keep commit ids which are 32 bytes long. So even if we have 1M
commits to generate that would take 32Mb + hashset overhead. And the cases like
that should never happen in the first place - we do not expect to generate
derived data for 1M of commits except for the initial huge repo imports (and
for those cases we can afford 30Mb memory hit). If we in the state where we
need to generate too many commits we should just return an error to the user,
and we'll add it in the later diffs.

Reviewed By: krallin

Differential Revision: D16438342

fbshipit-source-id: 4d82ea6111ac882dd5856319a16dda8392dfae81

2019-07-24 11:55:34 -07:00

derive_impl.rs

mononoke: avoid combinatoric explosion in derived_data framework

2019-07-24 11:55:34 -07:00

lib.rs

mononoke: basic derived data implementation

2019-07-24 07:35:41 -07:00