mirror of https://github.com/facebook/duckling.git synced 2024-09-11 21:27:13 +03:00

Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.

Go to file

Cody Ohlsen 474ae1b851 Duckling probabilistic layer bug fix Summary: while computing a score used to rank in Duckling, it currently sums up the log likelihoods learned during training. While ranking, the goal is to find the (same span) parse candidate which is _more_ likely to lead to a correct parse. However, the old logic was summing up the "more confident of the two classes" log likelihood.From what I understand this is the part which feels wrong. I created an example of two rules: #1. a rule where the classifier learns that the rule is very confidently NOT the correct parse. - okdata (positive class) is very low confidence (high negative number prior) - kodata (negative class) is very high confidence (low negative number prior) #2. a rule where the classifier is confident that it is the correct parse, but not Very Confident. - okdata (positive class) is high confidence (nonzero, but low negative number prior) - kodata (negative class) is very low confidence (high negative number prior) these two rules match the same regex, thus the same span. While duckling parses it, it turns out, that rule #1 ranks higher than rule #2. The reason why is because #1 is MORE confident that it is the INCORRECT (does not contribute to) parse than rule #2. Does this make sense? to solve this problem, I changed the ranking score estimation to use only the positive class scores (okdata). In the example above, it fixes it so rule #2 would end up ranking higher because the positive class confidence is higher than #1's positive class confidence. Would really love some deeper input from Duckling experts. I re-learned haskell and learned haxl to craft a small example here, and I am very new to Duckling (just started reading the ranking code on Friday). I know Duckling is battle-tested but I also don't believe that means a bug can't exist. And further, this specific bug may not happen a whole lot for 2 reasons: - there are not a lot of rules which end up higher negative confidence than positive (requires enough negative corpus examples over positive ones) - ranking uses span width first, and only when the spans are equivalent does the score based ranking come into play. So it requires that 2 rules match the same span before any actual score calculation even matters. Reviewed By: patapizza Differential Revision: D22009276 fbshipit-source-id: 13491689d39d810da526fa4bb8b6e526d4cafd35		2020-06-12 16:06:11 -07:00
dist-newstyle/cache	Add Numeral dimension for new language TH (#399 )	2019-11-27 15:48:38 -08:00
Duckling	Duckling probabilistic layer bug fix	2020-06-12 16:06:11 -07:00
exe	AF Setup + Numeral (#422 )	2020-01-10 15:02:50 -08:00
tests	ES/Duration: Add Copyright header to tests file	2020-06-11 10:17:44 -07:00
.dockerignore	Improve Docker build (#341 )	2020-04-17 08:22:43 -07:00
.gitignore	Hindi Language Numeral Dimension(minimalistic model). Tests passed.	2017-12-19 13:15:30 -08:00
.travis.yml	Update dependencies to latest version to make duckling compile with ghc 8.6.3 (#334 )	2019-02-22 11:46:50 -08:00
CODE_OF_CONDUCT.md	add FB code of conduct	2018-01-02 08:15:29 -08:00
CONTRIBUTING.md	Documentation: Coding style	2018-05-14 14:30:36 -07:00
Dockerfile	Improve Docker build (#341 )	2020-04-17 08:22:43 -07:00
duckling.cabal	Added support for parsing new ES duration phrases like half hour, quarter of hour. (#489 )	2020-06-09 15:16:38 -07:00
LICENSE	Initial commit	2017-03-08 10:33:56 -08:00
logo.png	Adding logo	2017-03-15 08:04:31 -07:00
README.md	Update: add new dimension to a language	2019-07-09 15:21:34 -07:00
stack.yaml	Update to lts-9.10	2017-10-26 18:34:27 -07:00

README.md

Duckling

Duckling is a Haskell library that parses text into structured data.

"the first Tuesday of October"
=> {"value":"2017-10-03T00:00:00.000-07:00","grain":"day"}

Requirements

A Haskell environment is required. We recommend using stack.

On macOS you'll need to install PCRE development headers. The easiest way to do that is with Homebrew:

brew install pcre

If that doesn't help, try running brew doctor and fix the issues it finds.

Quickstart

To compile and run the binary:

$ stack build
$ stack exec duckling-example-exe

The first time you run it, it will download all required packages.

This runs a basic HTTP server. Example request:

$ curl -XPOST http://0.0.0.0:8000/parse --data 'locale=en_GB&text=tomorrow at eight'

See exe/ExampleMain.hs for an example on how to integrate Duckling in your project. If your backend doesn't run Haskell or if you don't want to spin your own Duckling server, you can directly use wit.ai's built-in entities.

Supported dimensions

Duckling supports many languages, but most don't support all dimensions yet (we need your help!). Please look into this directory for language-specific support.

Dimension	Example input	Example value output
`AmountOfMoney`	"42€"	`{"value":42,"type":"value","unit":"EUR"}`
`CreditCardNumber`	"4111-1111-1111-1111"	`{"value":"4111111111111111","issuer":"visa"}`
`Distance`	"6 miles"	`{"value":6,"type":"value","unit":"mile"}`
`Duration`	"3 mins"	`{"value":3,"minute":3,"unit":"minute","normalized":{"value":180,"unit":"second"}}`
`Email`	"duckling-team@fb.com"	`{"value":"duckling-team@fb.com"}`
`Numeral`	"eighty eight"	`{"value":88,"type":"value"}`
`Ordinal`	"33rd"	`{"value":33,"type":"value"}`
`PhoneNumber`	"+1 (650) 123-4567"	`{"value":"(+1) 6501234567"}`
`Quantity`	"3 cups of sugar"	`{"value":3,"type":"value","product":"sugar","unit":"cup"}`
`Temperature`	"80F"	`{"value":80,"type":"value","unit":"fahrenheit"}`
`Time`	"today at 9am"	`{"values":[{"value":"2016-12-14T09:00:00.000-08:00","grain":"hour","type":"value"}],"value":"2016-12-14T09:00:00.000-08:00","grain":"hour","type":"value"}`
`Url`	"https://api.wit.ai/message?q=hi"	`{"value":"https://api.wit.ai/message?q=hi","domain":"api.wit.ai"}`
`Volume`	"4 gallons"	`{"value":4,"type":"value","unit":"gallon"}`

Custom dimensions are also supported.

Extending Duckling

To regenerate the classifiers and run the test suite:

$ stack build :duckling-regen-exe && stack exec duckling-regen-exe && stack test

It's important to regenerate the classifiers after updating the code and before running the test suite.

To extend Duckling's support for a dimension in a given language, typically 4 files need to be updated:

Duckling/<Dimension>/<Lang>/Rules.hs
Duckling/<Dimension>/<Lang>/Corpus.hs
Duckling/Dimensions/<Lang>.hs (if not already present in Duckling/Dimensions/Common.hs)
Duckling/Rules/<Lang>.hs

To add a new language:

Make sure that the language code used follows the ISO-639-1 standard.
The first dimension to implement is Numeral.
Follow this example.

To add a new locale:

There should be a need for diverging rules between the locale and the language.
Make sure that the locale code is a valid ISO3166 alpha2 country code.
Follow this example.

Rules have a name, a pattern and a production. Patterns are used to perform character-level matching (regexes on input) and concept-level matching (predicates on tokens). Productions are arbitrary functions that take a list of tokens and return a new token.

The corpus (resp. negative corpus) is a list of examples that should (resp. shouldn't) parse. The reference time for the corpus is Tuesday Feb 12, 2013 at 4:30am.

Duckling.Debug provides a few debugging tools:

$ stack repl --no-load
> :l Duckling.Debug
> debug (makeLocale EN $ Just US) "in two minutes" [This Time]
in|within|after <duration> (in two minutes)
-- regex (in)
-- <integer> <unit-of-duration> (two minutes)
-- -- integer (0..19) (two)
-- -- -- regex (two)
-- -- minute (grain) (minutes)
-- -- -- regex (minutes)
[Entity {dim = "time", body = "in two minutes", value = RVal Time (TimeValue (SimpleValue (InstantValue {vValue = 2013-02-12 04:32:00 -0200, vGrain = Second})) [SimpleValue (InstantValue {vValue = 2013-02-12 04:32:00 -0200, vGrain = Second})] Nothing), start = 0, end = 14}]

License

Duckling is BSD-licensed.