Summary:
This is the easiest way to fix it, but talking offline
with Julien, we may need to revisit.
It basically gets rid of time series where we were
producing intervals that are not a multiply of the grain.
Reviewed By: patapizza
Differential Revision: D4841759
fbshipit-source-id: 1c4742a
Summary: converting large regex lookups to hashmap lookups in Duckling/Numeral/FR/Rules.hs and Duckling/Ordinal/FR/Rules.hs
Reviewed By: patapizza
Differential Revision: D4836336
fbshipit-source-id: 2241a3a
Summary: Moves to the 8.6 resolver, updates package limits, and fixes errors due to upgrade.
Reviewed By: patapizza
Differential Revision: D4810924
fbshipit-source-id: c8a64a9
Summary:
This converts the code to monadic style, so that
we can in the future:
* stop threading the `Document` parameter everywhere
* keep some state, like regexp match cache (I've already checked that it makes a substantial difference)
There should be no difference in performance or behavior
at this point.
Reviewed By: patapizza
Differential Revision: D4778808
fbshipit-source-id: a167ed8
Summary:
`stack exe/RegenMain.hs` uses runghc which is a tool
we don't test with often. Making sure the executable
is rebuilt and using it should be enough.
Reviewed By: patapizza
Differential Revision: D4783844
fbshipit-source-id: 459dbc4
Summary:
This works around https://github.com/haskell/cabal/issues/4350
If we don't do this files get compiled multiple times
and cabal is unhappy.
Reviewed By: patapizza
Differential Revision: D4782749
fbshipit-source-id: 5bbe425
Summary: Use HashMaps to speed up string pattern matching for UK (Ukranian).
Reviewed By: patapizza
Differential Revision: D4747195
fbshipit-source-id: e582dba
Summary:
Contrary to my intuitions this part is the lion share
of allocations in `lookupRegexp`. I'd have expected `Text`
operations to dwarf it.
It's a bit doubious that we build such big lists that it
matters, perhaps in the future we can explore limiting the
number of matches considered.
Reviewed By: patapizza
Differential Revision: D4745711
fbshipit-source-id: ebdc1aa
Summary:
`isRangeValid` was doing lots of random indexing inside a Text.
Since we already have a convenient O(1), indexable `Vector Char`
we can just use it instead.
Reviewed By: patapizza
Differential Revision: D4744297
fbshipit-source-id: b23011b
Summary:
`isAdjacent` was doing a ton of useless copies and
redundant work. But pre-computing a `firstNonAdjacent` table
we can answer every `isAdjacent` query in `O(1)` time and
(almost?) no allocations.
It may be a symptom of algorithmic problems, but we shouldn't
make it more expensive than it needs to be.
Reviewed By: patapizza
Differential Revision: D4744172
fbshipit-source-id: dd70be2
Summary:
This will let me do smarter things on document construction,
like precomputing where all the whitespace is so that
I can answer `isAdjacent` in O(1) time.
If I'm measuring things right my next diff will cut down
allocations 4x on problematic inputs.
Reviewed By: patapizza
Differential Revision: D4742664
fbshipit-source-id: 7e14e25
Summary:
`Show` should print things close to source level representation.
I wanted to generate some tests from inputs that cause problems
and there was no way to get source level representation of
Dimension.
Reviewed By: patapizza
Differential Revision: D4723711
fbshipit-source-id: fff658d
Summary:
* 'no dia 20' (on the 20)
* Unifying two rules into one, with a day grain
See https://github.com/wit-ai/wit/issues/388
Reviewed By: blandinw
Differential Revision: D4715780
fbshipit-source-id: e990954
Summary:
No need to reinvent the wheel when `dependent-sum` has what we need. I re-export `Some(..)` from `Duckling.Dimensions.Types` to cut down on import bloat.
Instead of a `Read` instance I created a `fromName` function.
Reviewed By: zilberstein
Differential Revision: D4710014
fbshipit-source-id: 1d4e86d
Summary:
stack creates this directory, we should
prevent it from being commited.
Reviewed By: JonCoens
Differential Revision: D4713790
fbshipit-source-id: 34b723d
Summary:
* Simplified `Url` to only keep track of what we need (we can change back later)
* Normalize domain: remove subdomains like `www`, `www2` and lower case
* Return the full domain in the JSON value field
* Updated offensive url example
Reviewed By: JonCoens
Differential Revision: D4705403
fbshipit-source-id: e5d11ee
Summary: `DNumber` is a terrible name and was only there because legacy. `Numeral` makes more sense for this dimension, so let's use that instead.
Reviewed By: patapizza
Differential Revision: D4707167
fbshipit-source-id: cd78aa3
Summary:
The current regexp matches sequences of numbers of unbounded
length with lots of backtracking. Since phone numbers
are shorter than X=20 characters we can put a bound
on every currently unbounded match.
Additionally we can use groups that don't capture, to
avoid marshalling data that we won't need.
Reviewed By: JonCoens
Differential Revision: D4706862
fbshipit-source-id: 39ca9bb
Summary:
It is no longer necessary after D4676812 and D4698788.
`"I have 9 am 12 pm 1 pm 2pm 4 pm 3 pm on Saturday"` now works in
less than a second, it used to be 10s.
The test suite also got 3s faster.
Reviewed By: patapizza
Differential Revision: D4701890
fbshipit-source-id: 107a55f
Summary:
This is the next step for:
https://fb.facebook.com/groups/527352907463243/permalink/600056483526218/
This:
* changes the time language to be able to track contradictions (`EmptyPredicate`)
* changes the time language to be able to collect non-contradicting pieces, like month and hour and unify them
* provides an efficient way to convert those pieces into (past,future) time series
* adds AMPM predicate runner - there's a bit of overlap with is12H, but it basically works
* changes a test case that was wrong before
* regenerates classifiers, I'm not sure why they changed exactly
Before:
```
res <- H.io $ let sentence = "10am thurs 4.30 thurs 12pm sat" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(15.50 secs, 6,171,188,928 bytes)
res <- H.io $ let sentence = "I have 9 am 12 pm 1 pm 2pm 4 pm 3 pm on Saturday" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(110.82 secs, 44,031,569,512 bytes)
```
After:
```
res <- H.io $ let sentence = "10am thurs 4.30 thurs 12pm sat" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(1.24 secs, 703,020,912 bytes)
res <- H.io $ let sentence = "I have 9 am 12 pm 1 pm 2pm 4 pm 3 pm on Saturday" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(9.51 secs, 5,891,109,592 bytes)
```
Reviewed By: JonCoens
Differential Revision: D4676812
fbshipit-source-id: 9810203
Summary:
* we weren't checking the right reference time in `takeNth` and `takeN`
* fixing resulting failing tests for `IT`
* `analyzedNTest` to check that input results in `n` parsed tokens
Reviewed By: niteria
Differential Revision: D4698788
fbshipit-source-id: 2cd4762
Summary:
`cabal` is spewing this (it still successfully loads):
```
Warning: 'license: BSD' is not a recognised license. The known licenses are:
GPL, GPL-2, GPL-3, LGPL, LGPL-2.1, LGPL-3, AGPL, AGPL-3, BSD2, BSD3, MIT, ISC,
MPL-2.0, Apache, Apache-2.0, PublicDomain, AllRightsReserved, OtherLicense
```
Looking at the LICENSE file we have in the repo and the wikipedia page: https://en.wikipedia.org/wiki/BSD_licenses, it looks like we're using BSD3.
Reviewed By: patapizza
Differential Revision: D4697670
fbshipit-source-id: 6c80078