Commit Graph

112 Commits

Author SHA1 Message Date
Bartosz Nitka
8db73688d7 Move Document and helpers to a fresh module
Summary:
Document had its internal details leaked over 2 files.
This consolidates it.

It took a long time to make this perf neutral (now it's even a tiny
win), for reasons I don't completely understand.
The INLINE pragma on byteStringFromPos I semi-understand,
but I also had to move isRangeValid to Document and that's
a bit of a mystery.

Reviewed By: patapizza

Differential Revision: D4948449

fbshipit-source-id: ffb251a
2017-04-25 16:49:18 -07:00
Bartosz Nitka
924516103b Revert Duckling part of 'clean up unused imports'
Summary: it doesn't take .cabal into account

Reviewed By: patapizza

Differential Revision: D4938400

fbshipit-source-id: 8bc99a5
2017-04-24 07:34:27 -07:00
Julien Odent
dbe9e73541 Duration
Summary: Duration dimension for Hebrew.

Reviewed By: niteria

Differential Revision: D4930403

fbshipit-source-id: 690db8f
2017-04-24 06:49:40 -07:00
Julien Odent
efa38401b5 TimeGrain
Summary: TimeGrain dimension for Hebrew.

Reviewed By: niteria

Differential Revision: D4930294

fbshipit-source-id: 9c0f0da
2017-04-24 06:49:40 -07:00
Julien Odent
f5f4889770 Ordinal
Summary: Ordinal dimension for Hebrew.

Reviewed By: niteria

Differential Revision: D4930162

fbshipit-source-id: 02545ae
2017-04-24 06:49:40 -07:00
Julien Odent
bd96d3dd95 Setup + Numeral
Summary: Setup for Hebrew + Numeral dimension

Reviewed By: niteria

Differential Revision: D4930041

fbshipit-source-id: 965132b
2017-04-24 06:49:40 -07:00
Bartosz Nitka
b26aa7d84d clean up unused imports
Summary:
This diff was generated by running `hsclimps`

PLEASE TAKE ONE OF THE FOLLOWING ACTIONS AS SOON AS POSSIBLE:
  1) Select Accept and Ship to land this change
  2) If you have issues with this diff, request changes
  3) If you are no longer the owner, add reviewers and update the `.context` file with the appropriate owner

NOTE: If the diff is unable to land because of a merge conflict I will automatically update it for you.

#accept2ship

Reviewed By: niteria

Differential Revision: D4937839

fbshipit-source-id: bb3d330
2017-04-24 05:19:24 -07:00
Bartosz Nitka
7f7cc70d72 Make first pass more obvious
Summary:
Separating out the first pass lets us avoid repeated filtering
and makes the structure of the algorithm a bit more clear.

Previously `Stash.null` was used as a test for being part of
the first pass or not, but that is a bit indirect. Encoding
the algorithm structure (the state automaton) as function calls
lets us make additional assumptions.

It also has a nice side effect of costs being attributed to
first/subsequent passes in the profile.

I also prepend to `matches` because it's likely to be bigger.

Reviewed By: patapizza

Differential Revision: D4922195

fbshipit-source-id: 0aec79f
2017-04-20 11:49:15 -07:00
Bartosz Nitka
878f85b9e1 Codemod intersectMB to intersect
Summary:
`intersectMB` was a name used for the purpose of migrating.
This is the last part of the migration.

Reviewed By: patapizza

Differential Revision: D4906098

fbshipit-source-id: a70af78
2017-04-18 10:19:20 -07:00
Bartosz Nitka
fe39a55a4c Use intervalMB instead of interval
Summary:
This continues the work from:
"[Duckling] Don't produce trivially empty Tokens"
All the Rules should use intervalMB from now on.

Reviewed By: patapizza

Differential Revision: D4906072

fbshipit-source-id: 277b961
2017-04-18 10:19:20 -07:00
Bartosz Nitka
a91e787bb7 Derive Eq, Show for TimeIntervalType
Summary: This is always useful to have.

Reviewed By: patapizza

Differential Revision: D4864208

fbshipit-source-id: b879893
2017-04-18 08:19:20 -07:00
Bartosz Nitka
879b103ca3 Fix indexing problems with new regexp matcher
Summary:
My change had a couple of problems:
* utf8 character width logic was completely wrong for characters that need 3 or 4 bytes
* `Array.listArray (start, end)` produces an array where `end` is a valid index
* because of ^ the `arraySize` logic also has to change

Reviewed By: watashi, darshankapashi

Differential Revision: D4894355

fbshipit-source-id: 8d07dfd
2017-04-14 15:49:17 -07:00
Bartosz Nitka
e7aeef5436 Avoid allocations and encoding in regexp matching
Summary: The rationale is explained in a new Note.

Reviewed By: patapizza

Differential Revision: D4884104

fbshipit-source-id: 81f36ee
2017-04-14 12:19:21 -07:00
Bartosz Nitka
3d18cf5ea9 Don't produce trivially empty Tokens
Summary:
We can detect certain kinds of contradictions sooner,
producing a token with an unresolvable Predicate is wasteful.
For a text like:
```
"Demain apres midi 14h 15 h 16h vendredi 14 a 15h"
```
it could produce 7000 tokens with empty predicates.
After this change it produces none and we get a 4x improvement in
time and 6x improvement in allocations.

Note I only covered `ruleIntersect*` here. I need to do this for
other instances as well.

Reviewed By: JonCoens

Differential Revision: D4871078

fbshipit-source-id: 9f0e7ad
2017-04-11 16:35:05 -07:00
Kevin Cros
62bc5a317b Using hashmap look up instead of 'case of'
Summary: Updating regex with hashmap look ups.

Reviewed By: patapizza

Differential Revision: D4848178

fbshipit-source-id: 4d5ded8
2017-04-11 11:04:20 -07:00
ADAM LIU
928139569c Refactor of Duckling.Numeral.TR to hashmap lookup
Summary: Update of TR Rules hashmap

Reviewed By: patapizza

Differential Revision: D4860819

fbshipit-source-id: 6f5a722
2017-04-11 09:34:23 -07:00
Bartosz Nitka
f7b3f2ed73 Detect interval contradictions sooner
Summary:
So far contradictions from intersection only
propagated through intersection. This change
makes it so that it also propagates through intervals
and lets intervals also generate contradictions.

Reviewed By: patapizza

Differential Revision: D4864160

fbshipit-source-id: 8348267
2017-04-10 16:35:27 -07:00
Bartosz Nitka
1cf8496967 tt helper for returning Time Tokens
Summary:
This is a very common pattern (>1k occurrences).
Replacing it with something shorter makes the rules a bit less
boilerplate-y.
Feel free to bikeshed the name, I can easily redo the codemod.

Reviewed By: patapizza

Differential Revision: D4848864

fbshipit-source-id: 7baeee3
2017-04-10 12:34:43 -07:00
Bartosz Nitka
f46539ced2 Type for Closed/Open intervals
Summary:
This makes the code easier to read.
I'm not attached to naming, but this is
standard terminology from topology.

Reviewed By: JonCoens, patapizza

Differential Revision: D4848740

fbshipit-source-id: 79c2c20
2017-04-07 12:19:17 -07:00
Jonathan Coens
b3ca32104d Simple example HTTP server
Summary: Runs a `snap` server to return the support targets as well as do parsing. It's a bit cludgy, but gets the job done.

Reviewed By: patapizza

Differential Revision: D4813197

fbshipit-source-id: 0fa165b
2017-04-06 17:04:48 -07:00
ADAM LIU
572ff95adf Update RU Rules HashMap lookups update
Summary: Update of RU Rules hashmap

Reviewed By: patapizza

Differential Revision: D4840947

fbshipit-source-id: 00cb679
2017-04-06 15:49:17 -07:00
Bartosz Nitka
78ecaa3728 Derive NFData for Entity
Summary: This makes benchmarking easier.

Reviewed By: JonCoens

Differential Revision: D4846839

fbshipit-source-id: 9cc8dfa
2017-04-06 15:34:43 -07:00
Bartosz Nitka
290ca48e25 Fix 4:23am returning 5:23am
Summary:
This is the easiest way to fix it, but talking offline
with Julien, we may need to revisit.
It basically gets rid of time series where we were
producing intervals that are not a multiply of the grain.

Reviewed By: patapizza

Differential Revision: D4841759

fbshipit-source-id: 1c4742a
2017-04-06 11:04:16 -07:00
Amelia Wilson
70ef9b1bbe using hashmap lookups
Summary: converting large regex lookups to hashmap lookups in Duckling/Numeral/FR/Rules.hs and Duckling/Ordinal/FR/Rules.hs

Reviewed By: patapizza

Differential Revision: D4836336

fbshipit-source-id: 2241a3a
2017-04-05 12:20:10 -07:00
Jonathan Coens
7c47431ce5 Upgrade to stackage 8.8
Summary: Just a little bounds bump

Reviewed By: patapizza

Differential Revision: D4835536

fbshipit-source-id: d51fbb8
2017-04-05 11:19:31 -07:00
Jonathan Coens
e2da9bc7fb Upgrade to stackage 8.6
Summary: Moves to the 8.6 resolver, updates package limits, and fixes errors due to upgrade.

Reviewed By: patapizza

Differential Revision: D4810924

fbshipit-source-id: c8a64a9
2017-04-04 15:19:41 -07:00
Bartosz Nitka
e37bb7c186 Duckling monad for Engine
Summary:
This converts the code to monadic style, so that
we can in the future:
* stop threading the `Document` parameter everywhere
* keep some state, like regexp match cache (I've already checked that it makes a substantial difference)

There should be no difference in performance or behavior
at this point.

Reviewed By: patapizza

Differential Revision: D4778808

fbshipit-source-id: a167ed8
2017-03-31 14:19:40 -07:00
Julien Odent
78228dea83 Update email
Summary: Setup the correct email.

Reviewed By: JonCoens

Differential Revision: D4806876

fbshipit-source-id: a52f9f8
2017-03-30 16:20:08 -07:00
Bartosz Nitka
a1917a53f3 Make sure regen is rebuilt
Summary:
`stack exe/RegenMain.hs` uses runghc which is a tool
we don't test with often. Making sure the executable
is rebuilt and using it should be enough.

Reviewed By: patapizza

Differential Revision: D4783844

fbshipit-source-id: 459dbc4
2017-03-28 07:49:19 -07:00
Bartosz Nitka
bd94622f64 Move tests to tests and exes to exe
Summary:
This works around https://github.com/haskell/cabal/issues/4350
If we don't do this files get compiled multiple times
and cabal is unhappy.

Reviewed By: patapizza

Differential Revision: D4782749

fbshipit-source-id: 5bbe425
2017-03-27 16:04:24 -07:00
Christian Bell
02e74cacd6 HashMap lookups for large regexes
Summary: Use HashMaps to speed up string pattern matching for UK (Ukranian).

Reviewed By: patapizza

Differential Revision: D4747195

fbshipit-source-id: e582dba
2017-03-22 08:49:17 -07:00
Julien Odent
96f365e927 Expose toName
Summary: .

Reviewed By: niteria

Differential Revision: D4753842

fbshipit-source-id: 2e88e86
2017-03-22 08:19:19 -07:00
Bartosz Nitka
b108ab260f Allocate less in lookupRegexp
Summary:
Contrary to my intuitions this part is the lion share
of allocations in `lookupRegexp`. I'd have expected `Text`
operations to dwarf it.

It's a bit doubious that we build such big lists that it
matters, perhaps in the future we can explore limiting the
number of matches considered.

Reviewed By: patapizza

Differential Revision: D4745711

fbshipit-source-id: ebdc1aa
2017-03-21 09:19:18 -07:00
Bartosz Nitka
56a039eef1 Optimize isRangeValid
Summary:
`isRangeValid` was doing lots of random indexing inside a Text.
Since we already have a convenient O(1), indexable `Vector Char`
we can just use it instead.

Reviewed By: patapizza

Differential Revision: D4744297

fbshipit-source-id: b23011b
2017-03-21 08:49:16 -07:00
Bartosz Nitka
58bf36b9f4 Optimize isAdjacent
Summary:
`isAdjacent` was doing a ton of useless copies and
redundant work. But pre-computing a `firstNonAdjacent` table
we can answer every `isAdjacent` query in `O(1)` time and
(almost?) no allocations.

It may be a symptom of algorithmic problems, but we shouldn't
make it more expensive than it needs to be.

Reviewed By: patapizza

Differential Revision: D4744172

fbshipit-source-id: dd70be2
2017-03-21 07:34:24 -07:00
Bartosz Nitka
26b1327bcd Make Document type abstract
Summary:
This will let me do smarter things on document construction,
like precomputing where all the whitespace is so that
I can answer `isAdjacent` in O(1) time.

If I'm measuring things right my next diff will cut down
allocations 4x on problematic inputs.

Reviewed By: patapizza

Differential Revision: D4742664

fbshipit-source-id: 7e14e25
2017-03-20 20:49:24 -07:00
Bartosz Nitka
09acefbcf5 Make Show Dimension "law-abiding"
Summary:
`Show` should print things close to source level representation.
I wanted to generate some tests from inputs that cause problems
and there was no way to get source level representation of
Dimension.

Reviewed By: patapizza

Differential Revision: D4723711

fbshipit-source-id: fff658d
2017-03-16 16:34:16 -07:00
Julien Odent
e76cee3a6d Rename Finance to AmountOfMoney
Summary: Because it makes more sense.

Reviewed By: JonCoens

Differential Revision: D4721646

fbshipit-source-id: 449bfb4
2017-03-16 14:49:44 -07:00
Julien Odent
54c9448fba Rename Number to Numeral
Summary: For consistency with the dimension name.

Reviewed By: JonCoens

Differential Revision: D4722216

fbshipit-source-id: 82c56d3
2017-03-16 13:49:16 -07:00
Julien Odent
33fa98734a Fix 'no dia 20'
Summary:
* 'no dia 20' (on the 20)
* Unifying two rules into one, with a day grain

See https://github.com/wit-ai/wit/issues/388

Reviewed By: blandinw

Differential Revision: D4715780

fbshipit-source-id: e990954
2017-03-15 13:49:17 -07:00
Julien Odent
1c98c0308c Fix Some in README
Summary: #accept2ship

Reviewed By: niteria

Differential Revision: D4715804

fbshipit-source-id: d53ca9a
2017-03-15 13:19:36 -07:00
Jonathan Coens
41800a3171 Move onto dependent-sum instead of custom local data Some
Summary:
No need to reinvent the wheel when `dependent-sum` has what we need. I re-export `Some(..)` from `Duckling.Dimensions.Types` to cut down on import bloat.
Instead of a `Read` instance I created a `fromName` function.

Reviewed By: zilberstein

Differential Revision: D4710014

fbshipit-source-id: 1d4e86d
2017-03-15 10:34:17 -07:00
Bartosz Nitka
d23ae54ab9 .gitignore .stack-work
Summary:
stack creates this directory, we should
prevent it from being commited.

Reviewed By: JonCoens

Differential Revision: D4713790

fbshipit-source-id: 34b723d
2017-03-15 10:04:30 -07:00
Bartosz Nitka
1a251d8e42 Use HashMap.lookupDefault
Summary: This is a small stylystic improvement.

Reviewed By: patapizza

Differential Revision: D4713463

fbshipit-source-id: 47720d3
2017-03-15 08:19:11 -07:00
Julien Odent
1edf62f347 Adding logo
Summary: happy_duck

Reviewed By: niteria

Differential Revision: D4713395

fbshipit-source-id: dd1c141
2017-03-15 08:04:31 -07:00
Julien Odent
ea80ab07d3 Update maintainer email
Summary: .

Reviewed By: niteria

Differential Revision: D4713313

fbshipit-source-id: 4fbeabb
2017-03-15 07:49:12 -07:00
Julien Odent
cc016bb178 Refactoring + return domain
Summary:
* Simplified `Url` to only keep track of what we need (we can change back later)
* Normalize domain: remove subdomains like `www`, `www2` and lower case
* Return the full domain in the JSON value field
* Updated offensive url example

Reviewed By: JonCoens

Differential Revision: D4705403

fbshipit-source-id: e5d11ee
2017-03-14 13:49:20 -07:00
Jonathan Coens
1b91b70c58 codemod DNumber to Numeral
Summary: `DNumber` is a terrible name and was only there because legacy. `Numeral` makes more sense for this dimension, so let's use that instead.

Reviewed By: patapizza

Differential Revision: D4707167

fbshipit-source-id: cd78aa3
2017-03-14 13:34:11 -07:00
Bartosz Nitka
ec39c21593 Make the regexp less dangerous
Summary:
The current regexp matches sequences of numbers of unbounded
length with lots of backtracking. Since phone numbers
are shorter than X=20 characters we can put a bound
on every currently unbounded match.

Additionally we can use groups that don't capture, to
avoid marshalling data that we won't need.

Reviewed By: JonCoens

Differential Revision: D4706862

fbshipit-source-id: 39ca9bb
2017-03-14 12:19:12 -07:00
Julien Odent
2f4ecfba08 Update README
Summary: Doc to extend existing dimension/language support

Reviewed By: JonCoens

Differential Revision: D4706035

fbshipit-source-id: a8ecca4
2017-03-14 11:34:11 -07:00