Summary:
The correct gitignore matcher needs O(N^2) time to check a path which is N
directory deep. For example, to check "a/b/c/d", it needs to check:
- Whether .gitignore matches a/b/c/d
- Whether a/.gitignore matches b/c/d
- Whether a/b/.gitignore matches c/d
- Whether a/b/c/.gitignore matches d
- Whether .gitignore matches a/b/c
- Whether a/.gitignore matches b/c
- Whether a/b/.gitignore matches c
- Whether .gitignore matches a/b
- Whether a/.gitignore matches b
- Whether .gitignore matches a
It might not look that bad because N=4 for the above example. But when N is
larger (ex. node_modules/../node_modules/../node_modules/..), things get much
worse.
This patch adds "caching" about whether a directory is ignored or not. For
example, if "a/b/" is ignored, the new code would skip checking subdirectories
(ex. "a/b/c/"). The time complexity is now roughly O(N) gitignore tests instead
of O(N^2), since we only did a gitignore check for a parent directory of a path
being tested once, and then cache the parent directory result in a boolean
value.
To be clear, for the first time checking a path which is not ignored, it still
needs O(N^2) for initializing the trees. But once it's initialized, the next
time checking a file in a same directory, will be O(N).
`LruCache` is replaced by `HashMap` since it does not support `.get` and the
code needs that to work.
The perf issue was previously documented as a "PERF" comment.
This diff removes it.
Reviewed By: DurhamG
Differential Revision: D7496058
fbshipit-source-id: f10895b8f0d7dcdde6faf9daeec5cd78a1f15a2b
Summary:
The "pathmatcher" crate is intended to eventually cover more "matcher"
abilities so all Python "matcher" related logic can be handled by Rust.
For now, it only contains a gitignore matcher.
The gitignore matcher is designed to work in a repo (no need to create
multiple gitignore matchers for a repo from a higher layer), and be lazy
i.e. be tree-aware, and do not parse ".gitignore" unless necessary.
Worth mentioning that the gitignore logic provided by the "ignore" crate
seems decent in time complexity - it uses regular expression, which uses state
machines to achieve "testing against multiple patterns at once", instead of
testing patterns one-by-one like what git currently does.
Note: The "ignore" crate provides a nice "Walker" interface but that does
not fit very well with the required laziness here. So the walker interface
is not used.
Reviewed By: markbt
Differential Revision: D7319609
fbshipit-source-id: ebd131adf45a38f83acdf653f5e49d0624012152