Summary:
This change refactors the Engine to use a different
code path for when we're calling `lookupItem` to find
a first token `Node` matching the rule and a different
one for subsequent ones.
This division lets us get better invariants and more importantly
do full text regexp matches only when necessary.
This should be particularly useful for longer texts.
Reviewed By: patapizza
Differential Revision: D4953918
fbshipit-source-id: e3a69ad
Summary:
Duration dimension for Vietnamese.
This only uses the common rule.
Reviewed By: niteria
Differential Revision: D4962329
fbshipit-source-id: 9273245
Summary:
Document had its internal details leaked over 2 files.
This consolidates it.
It took a long time to make this perf neutral (now it's even a tiny
win), for reasons I don't completely understand.
The INLINE pragma on byteStringFromPos I semi-understand,
but I also had to move isRangeValid to Document and that's
a bit of a mystery.
Reviewed By: patapizza
Differential Revision: D4948449
fbshipit-source-id: ffb251a
Summary:
This diff was generated by running `hsclimps`
PLEASE TAKE ONE OF THE FOLLOWING ACTIONS AS SOON AS POSSIBLE:
1) Select Accept and Ship to land this change
2) If you have issues with this diff, request changes
3) If you are no longer the owner, add reviewers and update the `.context` file with the appropriate owner
NOTE: If the diff is unable to land because of a merge conflict I will automatically update it for you.
#accept2ship
Reviewed By: niteria
Differential Revision: D4937839
fbshipit-source-id: bb3d330
Summary:
Separating out the first pass lets us avoid repeated filtering
and makes the structure of the algorithm a bit more clear.
Previously `Stash.null` was used as a test for being part of
the first pass or not, but that is a bit indirect. Encoding
the algorithm structure (the state automaton) as function calls
lets us make additional assumptions.
It also has a nice side effect of costs being attributed to
first/subsequent passes in the profile.
I also prepend to `matches` because it's likely to be bigger.
Reviewed By: patapizza
Differential Revision: D4922195
fbshipit-source-id: 0aec79f
Summary:
`intersectMB` was a name used for the purpose of migrating.
This is the last part of the migration.
Reviewed By: patapizza
Differential Revision: D4906098
fbshipit-source-id: a70af78
Summary:
This continues the work from:
"[Duckling] Don't produce trivially empty Tokens"
All the Rules should use intervalMB from now on.
Reviewed By: patapizza
Differential Revision: D4906072
fbshipit-source-id: 277b961
Summary:
My change had a couple of problems:
* utf8 character width logic was completely wrong for characters that need 3 or 4 bytes
* `Array.listArray (start, end)` produces an array where `end` is a valid index
* because of ^ the `arraySize` logic also has to change
Reviewed By: watashi, darshankapashi
Differential Revision: D4894355
fbshipit-source-id: 8d07dfd
Summary:
We can detect certain kinds of contradictions sooner,
producing a token with an unresolvable Predicate is wasteful.
For a text like:
```
"Demain apres midi 14h 15 h 16h vendredi 14 a 15h"
```
it could produce 7000 tokens with empty predicates.
After this change it produces none and we get a 4x improvement in
time and 6x improvement in allocations.
Note I only covered `ruleIntersect*` here. I need to do this for
other instances as well.
Reviewed By: JonCoens
Differential Revision: D4871078
fbshipit-source-id: 9f0e7ad
Summary:
So far contradictions from intersection only
propagated through intersection. This change
makes it so that it also propagates through intervals
and lets intervals also generate contradictions.
Reviewed By: patapizza
Differential Revision: D4864160
fbshipit-source-id: 8348267
Summary:
This is a very common pattern (>1k occurrences).
Replacing it with something shorter makes the rules a bit less
boilerplate-y.
Feel free to bikeshed the name, I can easily redo the codemod.
Reviewed By: patapizza
Differential Revision: D4848864
fbshipit-source-id: 7baeee3
Summary:
This makes the code easier to read.
I'm not attached to naming, but this is
standard terminology from topology.
Reviewed By: JonCoens, patapizza
Differential Revision: D4848740
fbshipit-source-id: 79c2c20
Summary: Runs a `snap` server to return the support targets as well as do parsing. It's a bit cludgy, but gets the job done.
Reviewed By: patapizza
Differential Revision: D4813197
fbshipit-source-id: 0fa165b
Summary:
This is the easiest way to fix it, but talking offline
with Julien, we may need to revisit.
It basically gets rid of time series where we were
producing intervals that are not a multiply of the grain.
Reviewed By: patapizza
Differential Revision: D4841759
fbshipit-source-id: 1c4742a
Summary: converting large regex lookups to hashmap lookups in Duckling/Numeral/FR/Rules.hs and Duckling/Ordinal/FR/Rules.hs
Reviewed By: patapizza
Differential Revision: D4836336
fbshipit-source-id: 2241a3a
Summary: Moves to the 8.6 resolver, updates package limits, and fixes errors due to upgrade.
Reviewed By: patapizza
Differential Revision: D4810924
fbshipit-source-id: c8a64a9
Summary:
This converts the code to monadic style, so that
we can in the future:
* stop threading the `Document` parameter everywhere
* keep some state, like regexp match cache (I've already checked that it makes a substantial difference)
There should be no difference in performance or behavior
at this point.
Reviewed By: patapizza
Differential Revision: D4778808
fbshipit-source-id: a167ed8
Summary:
`stack exe/RegenMain.hs` uses runghc which is a tool
we don't test with often. Making sure the executable
is rebuilt and using it should be enough.
Reviewed By: patapizza
Differential Revision: D4783844
fbshipit-source-id: 459dbc4
Summary:
This works around https://github.com/haskell/cabal/issues/4350
If we don't do this files get compiled multiple times
and cabal is unhappy.
Reviewed By: patapizza
Differential Revision: D4782749
fbshipit-source-id: 5bbe425