Commit Graph

2 Commits

Author SHA1 Message Date
Jun Wu
3ffa0f28e2 gitignore: avoid quadratic behavior
Summary:
The correct gitignore matcher needs O(N^2) time to check a path which is N
directory deep. For example, to check "a/b/c/d", it needs to check:

  - Whether .gitignore matches a/b/c/d
  - Whether a/.gitignore matches b/c/d
  - Whether a/b/.gitignore matches c/d
  - Whether a/b/c/.gitignore matches d

  - Whether .gitignore matches a/b/c
  - Whether a/.gitignore matches b/c
  - Whether a/b/.gitignore matches c

  - Whether .gitignore matches a/b
  - Whether a/.gitignore matches b

  - Whether .gitignore matches a

It might not look that bad because N=4 for the above example. But when N is
larger (ex. node_modules/../node_modules/../node_modules/..), things get much
worse.

This patch adds "caching" about whether a directory is ignored or not. For
example, if "a/b/" is ignored, the new code would skip checking subdirectories
(ex. "a/b/c/"). The time complexity is now roughly O(N) gitignore tests instead
of O(N^2), since we only did a gitignore check for a parent directory of a path
being tested once, and then cache the parent directory result in a boolean
value.

To be clear, for the first time checking a path which is not ignored, it still
needs O(N^2) for initializing the trees. But once it's initialized, the next
time checking a file in a same directory, will be O(N).

`LruCache` is replaced by `HashMap` since it does not support `.get` and the
code needs that to work.

The perf issue was previously documented as a "PERF" comment.
This diff removes it.

Reviewed By: DurhamG

Differential Revision: D7496058

fbshipit-source-id: f10895b8f0d7dcdde6faf9daeec5cd78a1f15a2b
2018-04-13 21:51:48 -07:00
Jun Wu
283b8d130d pathmatcher: initial Rust matcher that handles gitignore lazily
Summary:
The "pathmatcher" crate is intended to eventually cover more "matcher"
abilities so all Python "matcher" related logic can be handled by Rust.
For now, it only contains a gitignore matcher.

The gitignore matcher is designed to work in a repo (no need to create
multiple gitignore matchers for a repo from a higher layer), and be lazy
i.e. be tree-aware, and do not parse ".gitignore" unless necessary.

Worth mentioning that the gitignore logic provided by the "ignore" crate
seems decent in time complexity - it uses regular expression, which uses state
machines to achieve "testing against multiple patterns at once", instead of
testing patterns one-by-one like what git currently does.

Note: The "ignore" crate provides a nice "Walker" interface but that does
not fit very well with the required laziness here. So the walker interface
is not used.

Reviewed By: markbt

Differential Revision: D7319609

fbshipit-source-id: ebd131adf45a38f83acdf653f5e49d0624012152
2018-04-13 21:51:40 -07:00