bloodhound/README.org

543 lines
15 KiB
Org Mode
Raw Normal View History

2014-04-07 22:24:58 +04:00
* Bloodhound
2014-04-12 14:14:27 +04:00
#+CAPTION: Bloodhound (dog)
[[./bloodhound.jpg]]
2014-04-12 23:09:36 +04:00
* Elasticsearch client and query DSL for Haskell
** Why?
2014-04-15 04:24:41 +04:00
Because you're tired of obnoxious errors like [[http://i.imgur.com/FKtZYIP.png][this]] and want types to guide your use of the API.
2014-04-12 23:09:36 +04:00
** Stability
Bloodhound is alpha at the moment. The library works fine, but I don't want to mislead anyone into thinking the API is final or stable. I wouldn't call the library "complete" or representative of everything you can do in Elasticsearch but being compared to clients in other languages the story here so far is good.
2014-04-12 23:09:36 +04:00
* Examples
** Index Operations
*** Create Index
#+BEGIN_SRC haskell
-- Formatted for use in ghci, so there are "let"s in front of the decls.
2014-04-14 05:34:01 +04:00
-- if you see :{ and :}, they're so you can copy-paste
-- the multi-line examples into your ghci REPL.
2014-04-14 05:16:44 +04:00
:set -XDeriveGeneric
2014-04-15 12:10:47 +04:00
import Database.Bloodhound
import Data.Aeson
import Data.Either (Either(..))
import Data.Maybe (fromJust)
import Data.Time.Calendar (Day(..))
import Data.Time.Clock (secondsToDiffTime, UTCTime(..))
import Data.Text (Text)
import GHC.Generics (Generic)
import Network.HTTP.Conduit
import qualified Network.HTTP.Types.Status as NHTS
-- no trailing slashes in servers, library handles building the path.
let testServer = (Server "http://localhost:9200")
let testIndex = IndexName "twitter"
let testMapping = MappingName "tweet"
2014-04-15 12:10:47 +04:00
-- defaultIndexSettings is exported by Database.Bloodhound as well
let defaultIndexSettings = IndexSettings (ShardCount 3) (ReplicaCount 2)
-- createIndex returns IO Reply
-- response :: Reply, Reply is a synonym for Network.HTTP.Conduit.Response
response <- createIndex testServer defaultIndexSettings testIndex
#+END_SRC
*** Delete Index
#+BEGIN_SRC haskell
-- response :: Reply
response <- deleteIndex testServer testIndex
-- print response if it was a success
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length", "21")]
, responseBody = "{\"acknowledged\":true}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
-- if the index to be deleted didn't exist anyway
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 404, statusMessage = "Not Found"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length","65")]
, responseBody = "{\"error\":\"IndexMissingException[[twitter] missing]\",\"status\":404}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
#+END_SRC
*** Refresh Index
**** Note, you *have* to do this if you expect to read what you just wrote
#+BEGIN_SRC haskell
resp <- refreshIndex testServer testIndex
-- print resp on success
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length","50")]
, responseBody = "{\"_shards\":{\"total\":10,\"successful\":5,\"failed\":0}}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
#+END_SRC
** Mapping Operations
*** Create Mapping
#+BEGIN_SRC haskell
-- don't forget imports and the like at the top.
data TweetMapping = TweetMapping deriving (Eq, Show)
2014-04-14 05:24:23 +04:00
-- I know writing the JSON manually sucks.
-- I don't have a proper data type for Mappings yet.
-- Let me know if this is something you need.
:{
instance ToJSON TweetMapping where
toJSON TweetMapping =
object ["tweet" .=
object ["properties" .=
object ["location" .=
object ["type" .= ("geo_point" :: Text)]]]]
:}
resp <- createMapping testServer testIndex testMapping TweetMapping
#+END_SRC
*** Delete Mapping
#+BEGIN_SRC haskell
resp <- deleteMapping testServer testIndex testMapping
#+END_SRC
** Document Operations
*** Indexing Documents
#+BEGIN_SRC haskell
2014-04-14 05:19:18 +04:00
-- don't forget the imports and derive generic setting for ghci
-- at the beginning of the examples.
:{
data Location = Location { lat :: Double
, lon :: Double } deriving (Eq, Generic, Show)
data Tweet = Tweet { user :: Text
, postDate :: UTCTime
, message :: Text
, age :: Int
, location :: Location } deriving (Eq, Generic, Show)
exampleTweet = Tweet { user = "bitemyapp"
, postDate = UTCTime
(ModifiedJulianDay 55000)
(secondsToDiffTime 10)
, message = "Use haskell!"
, age = 10000
, location = Location 40.12 (-71.34) }
-- automagic (generic) derivation of instances because we're lazy.
instance ToJSON Tweet
instance FromJSON Tweet
instance ToJSON Location
instance FromJSON Location
:}
-- Should be able to toJSON and encode the data structures like this:
-- λ> toJSON $ Location 10.0 10.0
-- Object fromList [("lat",Number 10.0),("lon",Number 10.0)]
-- λ> encode $ Location 10.0 10.0
-- "{\"lat\":10,\"lon\":10}"
resp <- indexDocument testServer testIndex testMapping exampleTweet (DocId "1")
-- print resp on success
2014-04-14 05:19:18 +04:00
Response {responseStatus =
Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1, responseHeaders =
[("Content-Type","application/json; charset=UTF-8"),
("Content-Length","75")]
, responseBody = "{\"_index\":\"twitter\",\"_type\":\"tweet\",\"_id\":\"1\",\"_version\":2,\"created\":false}"
, responseCookieJar = CJ {expose = []}, responseClose' = ResponseClose}
#+END_SRC
*** Deleting Documents
#+BEGIN_SRC haskell
resp <- deleteDocument testServer testIndex testMapping (DocId "1")
#+END_SRC
*** Getting Documents
#+BEGIN_SRC haskell
-- n.b., you'll need the earlier imports. responseBody is from http-conduit
resp <- getDocument testServer testIndex testMapping (DocId "1")
-- responseBody :: Response body -> body
let body = responseBody resp
-- you have two options, you use decode and just get Maybe (EsResult Tweet)
-- or you can use eitherDecode and get Either String (EsResult Tweet)
let maybeResult = decode body :: Maybe (EsResult Tweet)
-- the explicit typing is so Aeson knows how to parse the JSON.
-- use either if you want to know why something failed to parse.
-- (string errors, sadly)
let eitherResult = decode body :: Either String (EsResult Tweet)
-- print eitherResult should look like:
2014-04-14 05:20:47 +04:00
Right (EsResult {_index = "twitter"
, _type = "tweet"
, _id = "1"
, _version = 2
, found = Just True
, _source = Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}}})
-- _source in EsResult is parametric, we dispatch the type by passing in what we expect (Tweet) as a parameter to EsResult.
2014-04-14 05:16:44 +04:00
-- use the _source record accessor to get at your document
λ> fmap _source result
2014-04-14 05:20:47 +04:00
Right (Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}})
2014-04-14 05:16:44 +04:00
#+END_SRC
** Search
*** Querying
2014-04-14 05:16:44 +04:00
**** Term Query
#+BEGIN_SRC haskell
-- exported by the Client module, just defaults some stuff.
-- mkSearch :: Maybe Query -> Maybe Filter -> Search
-- mkSearch query filter = Search query filter Nothing False 0 10
let query = TermQuery (Term "user" "bitemyapp") Nothing
-- AND'ing identity filter with itself and then tacking it onto a query
-- search should be a null-operation. I include it for the sake of example.
-- <||> (or/plus) should make it into a search that returns everything.
let filter = IdentityFilter <&&> IdentityFilter
2014-04-14 05:27:31 +04:00
-- constructing the search object the searchByIndex function dispatches on.
2014-04-14 05:16:44 +04:00
let search = mkSearch (Just query) (Just filter)
2014-04-14 05:27:31 +04:00
-- you can also searchByType and specify the mapping name.
2014-04-14 05:16:44 +04:00
reply <- searchByIndex testServer testIndex search
2014-04-14 05:27:31 +04:00
2014-04-14 05:16:44 +04:00
let result = eitherDecode (responseBody reply) :: Either String (SearchResult Tweet)
λ> fmap (hits . searchHits) result
2014-04-14 05:25:26 +04:00
Right [Hit {hitIndex = IndexName "twitter"
, hitType = MappingName "tweet"
, hitDocId = DocId "1"
, hitScore = 0.30685282
, hitSource = Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}}}]
2014-04-14 05:16:44 +04:00
#+END_SRC
2014-04-15 12:10:47 +04:00
*** Sorting
#+BEGIN_SRC haskell
#+END_SRC
*** Filtering
2014-04-14 08:30:21 +04:00
**** And, Not, and Or filters
Filters form a monoid and seminearring.
#+BEGIN_SRC haskell
instance Monoid Filter where
mempty = IdentityFilter
mappend a b = AndFilter [a, b] defaultCache
instance Seminearring Filter where
a <||> b = OrFilter [a, b] defaultCache
-- AndFilter and OrFilter take [Filter] as an argument.
-- This will return anything, because IdentityFilter returns everything
OrFilter [IdentityFilter, someOtherFilter] False
-- This will return exactly what someOtherFilter returns
AndFilter [IdentityFilter, someOtherFilter] False
-- Thanks to the seminearring and monoid, the above can be expressed as:
-- "and"
IdentityFilter <&&> someOtherFilter
-- "or"
IdentityFilter <||> someOtherFilter
-- Also there is a NotFilter, it only accepts a single filter, not a list.
NotFilter someOtherFilter False
#+END_SRC
**** Identity Filter
#+BEGIN_SRC haskell
-- And'ing two Identity
let queryFilter = IdentityFilter <&&> IdentityFilter
let search = mkSearch Nothing (Just queryFilter)
reply <- searchByType testServer testIndex testMapping search
#+END_SRC
**** Boolean Filter
Similar to boolean queries.
#+BEGIN_SRC haskell
-- Will return only items whose "user" field contains the term "bitemyapp"
let queryFilter = BoolFilter (MustMatch (Term "user" "bitemyapp") False)
-- Will return only items whose "user" field does not contain the term "bitemyapp"
let queryFilter = BoolFilter (MustNotMatch (Term "user" "bitemyapp") False)
-- The clause (query) should appear in the matching document.
-- In a boolean query with no must clauses, one or more should
-- clauses must match a document. The minimum number of should
-- clauses to match can be set using the minimum_should_match parameter.
let queryFilter = BoolFilter (ShouldMatch [(Term "user" "bitemyapp")] False)
#+END_SRC
**** Exists Filter
#+BEGIN_SRC haskell
-- Will filter for documents that have the field "user"
let existsFilter = ExistsFilter (FieldName "user")
#+END_SRC
**** Geo BoundingBox Filter
#+BEGIN_SRC haskell
-- topLeft and bottomRight
let box = GeoBoundingBox (LatLon 40.73 (-74.1)) (LatLon 40.10 (-71.12))
let constraint = GeoBoundingBoxConstraint (FieldName "tweet.location") box False
-- second argument is GeoFilterType, memory or indexed.
let geoFilter = GeoBoundingBoxFilter constraint GeoFilterMemory
#+END_SRC
**** Geo Distance Filter
#+BEGIN_SRC haskell
let geoPoint = GeoPoint (FieldName "tweet.location") (LatLon 40.12 (-71.34))
-- coefficient and units
let distance = Distance 10.0 Miles
-- GeoFilterType or NoOptimizeBbox
let optimizeBbox = OptimizeGeoFilterType GeoFilterMemory
-- SloppyArc is the usual/default optimization in Elasticsearch today
-- but pre-1.0 versions will need to pick Arc or Plane.
let geoFilter = GeoDistanceFilter geoPoint distance SloppyArc optimizeBbox False
#+END_SRC
**** Geo Distance Range Filter
Think of a donut and you won't be far off.
#+BEGIN_SRC haskell
let geoPoint = GeoPoint (FieldName "tweet.location") (LatLon 40.12 (-71.34))
let distanceRange = DistanceRange (Distance 0.0 Miles) (Distance 10.0 Miles)
let geoFilter = GeoDistanceRangeFilter geoPoint distanceRange
#+END_SRC
**** Geo Polygon Filter
#+BEGIN_SRC haskell
-- I think I drew a square here.
let points = [LatLon 40.0 (-70.00),
LatLon 40.0 (-72.00),
LatLon 41.0 (-70.00),
LatLon 41.0 (-72.00)]
let geoFilter = GeoPolygonFilter (FieldName "tweet.location") points
#+END_SRC
**** Document IDs filter
#+BEGIN_SRC haskell
-- takes a mapping name and a list of DocIds
IdsFilter (MappingName "tweet") [DocId "1"]
#+END_SRC
**** Range Filter
***** Full Range
#+BEGIN_SRC haskell
-- RangeFilter :: FieldName
-- -> Either HalfRange Range
-- -> RangeExecution
-- -> Cache -> Filter
let filter = RangeFilter (FieldName "age")
(Right (RangeLtGt (LessThan 100000.0) (GreaterThan 1000.0)))
RangeExecutionIndex False
#+END_SRC
***** Half Range
#+BEGIN_SRC haskell
let filter = RangeFilter (FieldName "age")
(Left (HalfRangeLt (LessThan 100000.0)))
RangeExecutionIndex False
#+END_SRC
2014-04-14 08:41:34 +04:00
**** Regexp Filter
2014-04-14 08:30:21 +04:00
#+BEGIN_SRC haskell
-- RegexpFilter
-- :: FieldName
-- -> Regexp
-- -> RegexpFlags
-- -> CacheName
-- -> Cache
-- -> CacheKey
-- -> Filter
let filter = RegexpFilter (FieldName "user") (Regexp "bite.*app")
RegexpAll (CacheName "test") False (CacheKey "key")
-- RegexpFlags can be a combination of RegexpAll, Complement,
-- Interval, Intersection, AnyString, and a combination of two options thereof.
#+END_SRC
2014-04-12 23:09:36 +04:00
* Possible future functionality
** Node discovery and failover
Might require TCP support.
** Support for TCP access to Elasticsearch
Pretend to be a transport client?
** Bulk cluster-join merge
Might require making a lucene index on disk with the appropriate format.
2014-04-12 23:09:36 +04:00
** GeoShapeFilter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-filter.html
2014-04-12 23:09:36 +04:00
** Geohash cell filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geohash-cell-filter.html
2014-04-12 23:09:36 +04:00
** HasChild Filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-has-child-filter.html
2014-04-12 23:09:36 +04:00
** HasParent Filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-has-parent-filter.html
2014-04-12 23:09:36 +04:00
** Indices Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-indices-filter.html
2014-04-12 23:09:36 +04:00
** Query Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-filter.html
2014-04-12 23:09:36 +04:00
** Script based sorting
2014-04-12 03:13:19 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-sort.html#_script_based_sorting
2014-04-15 05:23:40 +04:00
** Collapsing redundantly nested and/or structures
The Seminearring instance, if deeply nested can possibly produce nested structure that is redundant. Depending on how this affects ES perforamnce, reducing this structure might be valuable.
2014-04-12 23:09:36 +04:00
** Runtime checking for cycles in data structures
check for n > 1 occurrences in DFS:
http://hackage.haskell.org/package/stable-maps-0.0.5/docs/System-Mem-StableName-Dynamic.html
http://hackage.haskell.org/package/stable-maps-0.0.5/docs/System-Mem-StableName-Dynamic-Map.html
2014-04-12 14:12:17 +04:00
2014-04-12 23:09:36 +04:00
* Photo Origin
2014-04-12 14:12:17 +04:00
2014-04-14 05:25:58 +04:00
Photo from HA! Designs: https://www.flickr.com/photos/hadesigns/