Summary:
Document had its internal details leaked over 2 files.
This consolidates it.
It took a long time to make this perf neutral (now it's even a tiny
win), for reasons I don't completely understand.
The INLINE pragma on byteStringFromPos I semi-understand,
but I also had to move isRangeValid to Document and that's
a bit of a mystery.
Reviewed By: patapizza
Differential Revision: D4948449
fbshipit-source-id: ffb251a
Summary:
This diff was generated by running `hsclimps`
PLEASE TAKE ONE OF THE FOLLOWING ACTIONS AS SOON AS POSSIBLE:
1) Select Accept and Ship to land this change
2) If you have issues with this diff, request changes
3) If you are no longer the owner, add reviewers and update the `.context` file with the appropriate owner
NOTE: If the diff is unable to land because of a merge conflict I will automatically update it for you.
#accept2ship
Reviewed By: niteria
Differential Revision: D4937839
fbshipit-source-id: bb3d330
Summary:
Separating out the first pass lets us avoid repeated filtering
and makes the structure of the algorithm a bit more clear.
Previously `Stash.null` was used as a test for being part of
the first pass or not, but that is a bit indirect. Encoding
the algorithm structure (the state automaton) as function calls
lets us make additional assumptions.
It also has a nice side effect of costs being attributed to
first/subsequent passes in the profile.
I also prepend to `matches` because it's likely to be bigger.
Reviewed By: patapizza
Differential Revision: D4922195
fbshipit-source-id: 0aec79f
Summary:
`intersectMB` was a name used for the purpose of migrating.
This is the last part of the migration.
Reviewed By: patapizza
Differential Revision: D4906098
fbshipit-source-id: a70af78
Summary:
This continues the work from:
"[Duckling] Don't produce trivially empty Tokens"
All the Rules should use intervalMB from now on.
Reviewed By: patapizza
Differential Revision: D4906072
fbshipit-source-id: 277b961
Summary:
My change had a couple of problems:
* utf8 character width logic was completely wrong for characters that need 3 or 4 bytes
* `Array.listArray (start, end)` produces an array where `end` is a valid index
* because of ^ the `arraySize` logic also has to change
Reviewed By: watashi, darshankapashi
Differential Revision: D4894355
fbshipit-source-id: 8d07dfd
Summary:
We can detect certain kinds of contradictions sooner,
producing a token with an unresolvable Predicate is wasteful.
For a text like:
```
"Demain apres midi 14h 15 h 16h vendredi 14 a 15h"
```
it could produce 7000 tokens with empty predicates.
After this change it produces none and we get a 4x improvement in
time and 6x improvement in allocations.
Note I only covered `ruleIntersect*` here. I need to do this for
other instances as well.
Reviewed By: JonCoens
Differential Revision: D4871078
fbshipit-source-id: 9f0e7ad
Summary:
So far contradictions from intersection only
propagated through intersection. This change
makes it so that it also propagates through intervals
and lets intervals also generate contradictions.
Reviewed By: patapizza
Differential Revision: D4864160
fbshipit-source-id: 8348267
Summary:
This is a very common pattern (>1k occurrences).
Replacing it with something shorter makes the rules a bit less
boilerplate-y.
Feel free to bikeshed the name, I can easily redo the codemod.
Reviewed By: patapizza
Differential Revision: D4848864
fbshipit-source-id: 7baeee3
Summary:
This makes the code easier to read.
I'm not attached to naming, but this is
standard terminology from topology.
Reviewed By: JonCoens, patapizza
Differential Revision: D4848740
fbshipit-source-id: 79c2c20
Summary: Runs a `snap` server to return the support targets as well as do parsing. It's a bit cludgy, but gets the job done.
Reviewed By: patapizza
Differential Revision: D4813197
fbshipit-source-id: 0fa165b
Summary:
This is the easiest way to fix it, but talking offline
with Julien, we may need to revisit.
It basically gets rid of time series where we were
producing intervals that are not a multiply of the grain.
Reviewed By: patapizza
Differential Revision: D4841759
fbshipit-source-id: 1c4742a
Summary: converting large regex lookups to hashmap lookups in Duckling/Numeral/FR/Rules.hs and Duckling/Ordinal/FR/Rules.hs
Reviewed By: patapizza
Differential Revision: D4836336
fbshipit-source-id: 2241a3a
Summary: Moves to the 8.6 resolver, updates package limits, and fixes errors due to upgrade.
Reviewed By: patapizza
Differential Revision: D4810924
fbshipit-source-id: c8a64a9
Summary:
This converts the code to monadic style, so that
we can in the future:
* stop threading the `Document` parameter everywhere
* keep some state, like regexp match cache (I've already checked that it makes a substantial difference)
There should be no difference in performance or behavior
at this point.
Reviewed By: patapizza
Differential Revision: D4778808
fbshipit-source-id: a167ed8
Summary:
`stack exe/RegenMain.hs` uses runghc which is a tool
we don't test with often. Making sure the executable
is rebuilt and using it should be enough.
Reviewed By: patapizza
Differential Revision: D4783844
fbshipit-source-id: 459dbc4
Summary:
This works around https://github.com/haskell/cabal/issues/4350
If we don't do this files get compiled multiple times
and cabal is unhappy.
Reviewed By: patapizza
Differential Revision: D4782749
fbshipit-source-id: 5bbe425
Summary: Use HashMaps to speed up string pattern matching for UK (Ukranian).
Reviewed By: patapizza
Differential Revision: D4747195
fbshipit-source-id: e582dba
Summary:
Contrary to my intuitions this part is the lion share
of allocations in `lookupRegexp`. I'd have expected `Text`
operations to dwarf it.
It's a bit doubious that we build such big lists that it
matters, perhaps in the future we can explore limiting the
number of matches considered.
Reviewed By: patapizza
Differential Revision: D4745711
fbshipit-source-id: ebdc1aa
Summary:
`isRangeValid` was doing lots of random indexing inside a Text.
Since we already have a convenient O(1), indexable `Vector Char`
we can just use it instead.
Reviewed By: patapizza
Differential Revision: D4744297
fbshipit-source-id: b23011b
Summary:
`isAdjacent` was doing a ton of useless copies and
redundant work. But pre-computing a `firstNonAdjacent` table
we can answer every `isAdjacent` query in `O(1)` time and
(almost?) no allocations.
It may be a symptom of algorithmic problems, but we shouldn't
make it more expensive than it needs to be.
Reviewed By: patapizza
Differential Revision: D4744172
fbshipit-source-id: dd70be2
Summary:
This will let me do smarter things on document construction,
like precomputing where all the whitespace is so that
I can answer `isAdjacent` in O(1) time.
If I'm measuring things right my next diff will cut down
allocations 4x on problematic inputs.
Reviewed By: patapizza
Differential Revision: D4742664
fbshipit-source-id: 7e14e25
Summary:
`Show` should print things close to source level representation.
I wanted to generate some tests from inputs that cause problems
and there was no way to get source level representation of
Dimension.
Reviewed By: patapizza
Differential Revision: D4723711
fbshipit-source-id: fff658d
Summary:
* 'no dia 20' (on the 20)
* Unifying two rules into one, with a day grain
See https://github.com/wit-ai/wit/issues/388
Reviewed By: blandinw
Differential Revision: D4715780
fbshipit-source-id: e990954
Summary:
No need to reinvent the wheel when `dependent-sum` has what we need. I re-export `Some(..)` from `Duckling.Dimensions.Types` to cut down on import bloat.
Instead of a `Read` instance I created a `fromName` function.
Reviewed By: zilberstein
Differential Revision: D4710014
fbshipit-source-id: 1d4e86d
Summary:
stack creates this directory, we should
prevent it from being commited.
Reviewed By: JonCoens
Differential Revision: D4713790
fbshipit-source-id: 34b723d
Summary:
* Simplified `Url` to only keep track of what we need (we can change back later)
* Normalize domain: remove subdomains like `www`, `www2` and lower case
* Return the full domain in the JSON value field
* Updated offensive url example
Reviewed By: JonCoens
Differential Revision: D4705403
fbshipit-source-id: e5d11ee
Summary: `DNumber` is a terrible name and was only there because legacy. `Numeral` makes more sense for this dimension, so let's use that instead.
Reviewed By: patapizza
Differential Revision: D4707167
fbshipit-source-id: cd78aa3
Summary:
The current regexp matches sequences of numbers of unbounded
length with lots of backtracking. Since phone numbers
are shorter than X=20 characters we can put a bound
on every currently unbounded match.
Additionally we can use groups that don't capture, to
avoid marshalling data that we won't need.
Reviewed By: JonCoens
Differential Revision: D4706862
fbshipit-source-id: 39ca9bb