Summary:
Contrary to my intuitions this part is the lion share
of allocations in `lookupRegexp`. I'd have expected `Text`
operations to dwarf it.
It's a bit doubious that we build such big lists that it
matters, perhaps in the future we can explore limiting the
number of matches considered.
Reviewed By: patapizza
Differential Revision: D4745711
fbshipit-source-id: ebdc1aa
Summary:
`isRangeValid` was doing lots of random indexing inside a Text.
Since we already have a convenient O(1), indexable `Vector Char`
we can just use it instead.
Reviewed By: patapizza
Differential Revision: D4744297
fbshipit-source-id: b23011b
Summary:
`isAdjacent` was doing a ton of useless copies and
redundant work. But pre-computing a `firstNonAdjacent` table
we can answer every `isAdjacent` query in `O(1)` time and
(almost?) no allocations.
It may be a symptom of algorithmic problems, but we shouldn't
make it more expensive than it needs to be.
Reviewed By: patapizza
Differential Revision: D4744172
fbshipit-source-id: dd70be2
Summary:
This will let me do smarter things on document construction,
like precomputing where all the whitespace is so that
I can answer `isAdjacent` in O(1) time.
If I'm measuring things right my next diff will cut down
allocations 4x on problematic inputs.
Reviewed By: patapizza
Differential Revision: D4742664
fbshipit-source-id: 7e14e25
Summary:
`Show` should print things close to source level representation.
I wanted to generate some tests from inputs that cause problems
and there was no way to get source level representation of
Dimension.
Reviewed By: patapizza
Differential Revision: D4723711
fbshipit-source-id: fff658d
Summary:
* 'no dia 20' (on the 20)
* Unifying two rules into one, with a day grain
See https://github.com/wit-ai/wit/issues/388
Reviewed By: blandinw
Differential Revision: D4715780
fbshipit-source-id: e990954
Summary:
No need to reinvent the wheel when `dependent-sum` has what we need. I re-export `Some(..)` from `Duckling.Dimensions.Types` to cut down on import bloat.
Instead of a `Read` instance I created a `fromName` function.
Reviewed By: zilberstein
Differential Revision: D4710014
fbshipit-source-id: 1d4e86d
Summary:
stack creates this directory, we should
prevent it from being commited.
Reviewed By: JonCoens
Differential Revision: D4713790
fbshipit-source-id: 34b723d
Summary:
* Simplified `Url` to only keep track of what we need (we can change back later)
* Normalize domain: remove subdomains like `www`, `www2` and lower case
* Return the full domain in the JSON value field
* Updated offensive url example
Reviewed By: JonCoens
Differential Revision: D4705403
fbshipit-source-id: e5d11ee
Summary: `DNumber` is a terrible name and was only there because legacy. `Numeral` makes more sense for this dimension, so let's use that instead.
Reviewed By: patapizza
Differential Revision: D4707167
fbshipit-source-id: cd78aa3
Summary:
The current regexp matches sequences of numbers of unbounded
length with lots of backtracking. Since phone numbers
are shorter than X=20 characters we can put a bound
on every currently unbounded match.
Additionally we can use groups that don't capture, to
avoid marshalling data that we won't need.
Reviewed By: JonCoens
Differential Revision: D4706862
fbshipit-source-id: 39ca9bb
Summary:
It is no longer necessary after D4676812 and D4698788.
`"I have 9 am 12 pm 1 pm 2pm 4 pm 3 pm on Saturday"` now works in
less than a second, it used to be 10s.
The test suite also got 3s faster.
Reviewed By: patapizza
Differential Revision: D4701890
fbshipit-source-id: 107a55f
Summary:
This is the next step for:
https://fb.facebook.com/groups/527352907463243/permalink/600056483526218/
This:
* changes the time language to be able to track contradictions (`EmptyPredicate`)
* changes the time language to be able to collect non-contradicting pieces, like month and hour and unify them
* provides an efficient way to convert those pieces into (past,future) time series
* adds AMPM predicate runner - there's a bit of overlap with is12H, but it basically works
* changes a test case that was wrong before
* regenerates classifiers, I'm not sure why they changed exactly
Before:
```
res <- H.io $ let sentence = "10am thurs 4.30 thurs 12pm sat" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(15.50 secs, 6,171,188,928 bytes)
res <- H.io $ let sentence = "I have 9 am 12 pm 1 pm 2pm 4 pm 3 pm on Saturday" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(110.82 secs, 44,031,569,512 bytes)
```
After:
```
res <- H.io $ let sentence = "10am thurs 4.30 thurs 12pm sat" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(1.24 secs, 703,020,912 bytes)
res <- H.io $ let sentence = "I have 9 am 12 pm 1 pm 2pm 4 pm 3 pm on Saturday" in (debugTokens sentence $ analyze sentence (testContext {lang = EN}) HashSet.empty)
(9.51 secs, 5,891,109,592 bytes)
```
Reviewed By: JonCoens
Differential Revision: D4676812
fbshipit-source-id: 9810203
Summary:
* we weren't checking the right reference time in `takeNth` and `takeN`
* fixing resulting failing tests for `IT`
* `analyzedNTest` to check that input results in `n` parsed tokens
Reviewed By: niteria
Differential Revision: D4698788
fbshipit-source-id: 2cd4762
Summary:
`cabal` is spewing this (it still successfully loads):
```
Warning: 'license: BSD' is not a recognised license. The known licenses are:
GPL, GPL-2, GPL-3, LGPL, LGPL-2.1, LGPL-3, AGPL, AGPL-3, BSD2, BSD3, MIT, ISC,
MPL-2.0, Apache, Apache-2.0, PublicDomain, AllRightsReserved, OtherLicense
```
Looking at the LICENSE file we have in the repo and the wikipedia page: https://en.wikipedia.org/wiki/BSD_licenses, it looks like we're using BSD3.
Reviewed By: patapizza
Differential Revision: D4697670
fbshipit-source-id: 6c80078