bloodhound/README.org

533 lines
15 KiB
Org Mode
Raw Normal View History

2014-04-07 22:24:58 +04:00
* Bloodhound
2014-04-12 14:14:27 +04:00
#+CAPTION: Bloodhound (dog)
[[./bloodhound.jpg]]
2014-04-12 23:09:36 +04:00
* Elasticsearch client and query DSL for Haskell
** Why?
Because you're tired of obnoxious errors like [this](http://i.imgur.com/FKtZYIP.png) and want types to guide your use of the API.
2014-04-12 23:09:36 +04:00
** Stability
Bloodhound is alpha at the moment. The library works fine, but I don't want to mislead anyone into thinking the API is final or stable. I wouldn't call the library "complete" or representative of everything you can do in Elasticsearch but being compared to clients in other languages the story here so far is good.
2014-04-12 23:09:36 +04:00
* Examples
** Index Operations
*** Create Index
#+BEGIN_SRC haskell
-- Formatted for use in ghci, so there are "let"s in front of the decls.
2014-04-14 05:34:01 +04:00
-- if you see :{ and :}, they're so you can copy-paste
-- the multi-line examples into your ghci REPL.
2014-04-14 05:16:44 +04:00
:set -XDeriveGeneric
import Database.Bloodhound.Client
import Data.Aeson
import Data.Either (Either(..))
import Data.Maybe (fromJust)
import Data.Time.Calendar (Day(..))
import Data.Time.Clock (secondsToDiffTime, UTCTime(..))
import Data.Text (Text)
import GHC.Generics (Generic)
import Network.HTTP.Conduit
import qualified Network.HTTP.Types.Status as NHTS
-- no trailing slashes in servers, library handles building the path.
let testServer = (Server "http://localhost:9200")
let testIndex = IndexName "twitter"
let testMapping = MappingName "tweet"
-- defaultIndexSettings is exported by Database.Bloodhound.Client as well.
let defaultIndexSettings = IndexSettings (ShardCount 3) (ReplicaCount 2)
-- createIndex returns IO Reply
-- response :: Reply, Reply is a synonym for Network.HTTP.Conduit.Response
response <- createIndex testServer defaultIndexSettings testIndex
#+END_SRC
*** Delete Index
#+BEGIN_SRC haskell
-- response :: Reply
response <- deleteIndex testServer testIndex
-- print response if it was a success
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length", "21")]
, responseBody = "{\"acknowledged\":true}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
-- if the index to be deleted didn't exist anyway
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 404, statusMessage = "Not Found"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length","65")]
, responseBody = "{\"error\":\"IndexMissingException[[twitter] missing]\",\"status\":404}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
#+END_SRC
*** Refresh Index
**** Note, you *have* to do this if you expect to read what you just wrote
#+BEGIN_SRC haskell
resp <- refreshIndex testServer testIndex
-- print resp on success
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length","50")]
, responseBody = "{\"_shards\":{\"total\":10,\"successful\":5,\"failed\":0}}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
#+END_SRC
** Mapping Operations
*** Create Mapping
#+BEGIN_SRC haskell
-- don't forget imports and the like at the top.
data TweetMapping = TweetMapping deriving (Eq, Show)
2014-04-14 05:24:23 +04:00
-- I know writing the JSON manually sucks.
-- I don't have a proper data type for Mappings yet.
-- Let me know if this is something you need.
:{
instance ToJSON TweetMapping where
toJSON TweetMapping =
object ["tweet" .=
object ["properties" .=
object ["location" .=
object ["type" .= ("geo_point" :: Text)]]]]
:}
resp <- createMapping testServer testIndex testMapping TweetMapping
#+END_SRC
*** Delete Mapping
#+BEGIN_SRC haskell
resp <- deleteMapping testServer testIndex testMapping
#+END_SRC
** Document Operations
*** Indexing Documents
#+BEGIN_SRC haskell
2014-04-14 05:19:18 +04:00
-- don't forget the imports and derive generic setting for ghci
-- at the beginning of the examples.
:{
data Location = Location { lat :: Double
, lon :: Double } deriving (Eq, Generic, Show)
data Tweet = Tweet { user :: Text
, postDate :: UTCTime
, message :: Text
, age :: Int
, location :: Location } deriving (Eq, Generic, Show)
exampleTweet = Tweet { user = "bitemyapp"
, postDate = UTCTime
(ModifiedJulianDay 55000)
(secondsToDiffTime 10)
, message = "Use haskell!"
, age = 10000
, location = Location 40.12 (-71.34) }
-- automagic (generic) derivation of instances because we're lazy.
instance ToJSON Tweet
instance FromJSON Tweet
instance ToJSON Location
instance FromJSON Location
:}
-- Should be able to toJSON and encode the data structures like this:
-- λ> toJSON $ Location 10.0 10.0
-- Object fromList [("lat",Number 10.0),("lon",Number 10.0)]
-- λ> encode $ Location 10.0 10.0
-- "{\"lat\":10,\"lon\":10}"
resp <- indexDocument testServer testIndex testMapping exampleTweet (DocId "1")
-- print resp on success
2014-04-14 05:19:18 +04:00
Response {responseStatus =
Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1, responseHeaders =
[("Content-Type","application/json; charset=UTF-8"),
("Content-Length","75")]
, responseBody = "{\"_index\":\"twitter\",\"_type\":\"tweet\",\"_id\":\"1\",\"_version\":2,\"created\":false}"
, responseCookieJar = CJ {expose = []}, responseClose' = ResponseClose}
#+END_SRC
*** Deleting Documents
#+BEGIN_SRC haskell
resp <- deleteDocument testServer testIndex testMapping (DocId "1")
#+END_SRC
*** Getting Documents
#+BEGIN_SRC haskell
-- n.b., you'll need the earlier imports. responseBody is from http-conduit
resp <- getDocument testServer testIndex testMapping (DocId "1")
-- responseBody :: Response body -> body
let body = responseBody resp
-- you have two options, you use decode and just get Maybe (EsResult Tweet)
-- or you can use eitherDecode and get Either String (EsResult Tweet)
let maybeResult = decode body :: Maybe (EsResult Tweet)
-- the explicit typing is so Aeson knows how to parse the JSON.
-- use either if you want to know why something failed to parse.
-- (string errors, sadly)
let eitherResult = decode body :: Either String (EsResult Tweet)
-- print eitherResult should look like:
2014-04-14 05:20:47 +04:00
Right (EsResult {_index = "twitter"
, _type = "tweet"
, _id = "1"
, _version = 2
, found = Just True
, _source = Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}}})
-- _source in EsResult is parametric, we dispatch the type by passing in what we expect (Tweet) as a parameter to EsResult.
2014-04-14 05:16:44 +04:00
-- use the _source record accessor to get at your document
λ> fmap _source result
2014-04-14 05:20:47 +04:00
Right (Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}})
2014-04-14 05:16:44 +04:00
#+END_SRC
** Search
*** Querying
2014-04-14 05:16:44 +04:00
**** Term Query
#+BEGIN_SRC haskell
-- exported by the Client module, just defaults some stuff.
-- mkSearch :: Maybe Query -> Maybe Filter -> Search
-- mkSearch query filter = Search query filter Nothing False 0 10
let query = TermQuery (Term "user" "bitemyapp") Nothing
-- AND'ing identity filter with itself and then tacking it onto a query
-- search should be a null-operation. I include it for the sake of example.
-- <||> (or/plus) should make it into a search that returns everything.
let filter = IdentityFilter <&&> IdentityFilter
2014-04-14 05:27:31 +04:00
-- constructing the search object the searchByIndex function dispatches on.
2014-04-14 05:16:44 +04:00
let search = mkSearch (Just query) (Just filter)
2014-04-14 05:27:31 +04:00
-- you can also searchByType and specify the mapping name.
2014-04-14 05:16:44 +04:00
reply <- searchByIndex testServer testIndex search
2014-04-14 05:27:31 +04:00
2014-04-14 05:16:44 +04:00
let result = eitherDecode (responseBody reply) :: Either String (SearchResult Tweet)
λ> fmap (hits . searchHits) result
2014-04-14 05:25:26 +04:00
Right [Hit {hitIndex = IndexName "twitter"
, hitType = MappingName "tweet"
, hitDocId = DocId "1"
, hitScore = 0.30685282
, hitSource = Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}}}]
2014-04-14 05:16:44 +04:00
#+END_SRC
*** Filtering
2014-04-14 08:30:21 +04:00
**** And, Not, and Or filters
Filters form a monoid and seminearring.
#+BEGIN_SRC haskell
instance Monoid Filter where
mempty = IdentityFilter
mappend a b = AndFilter [a, b] defaultCache
instance Seminearring Filter where
a <||> b = OrFilter [a, b] defaultCache
-- AndFilter and OrFilter take [Filter] as an argument.
-- This will return anything, because IdentityFilter returns everything
OrFilter [IdentityFilter, someOtherFilter] False
-- This will return exactly what someOtherFilter returns
AndFilter [IdentityFilter, someOtherFilter] False
-- Thanks to the seminearring and monoid, the above can be expressed as:
-- "and"
IdentityFilter <&&> someOtherFilter
-- "or"
IdentityFilter <||> someOtherFilter
-- Also there is a NotFilter, it only accepts a single filter, not a list.
NotFilter someOtherFilter False
#+END_SRC
**** Identity Filter
#+BEGIN_SRC haskell
-- And'ing two Identity
let queryFilter = IdentityFilter <&&> IdentityFilter
let search = mkSearch Nothing (Just queryFilter)
reply <- searchByType testServer testIndex testMapping search
#+END_SRC
**** Boolean Filter
Similar to boolean queries.
#+BEGIN_SRC haskell
-- Will return only items whose "user" field contains the term "bitemyapp"
let queryFilter = BoolFilter (MustMatch (Term "user" "bitemyapp") False)
-- Will return only items whose "user" field does not contain the term "bitemyapp"
let queryFilter = BoolFilter (MustNotMatch (Term "user" "bitemyapp") False)
-- The clause (query) should appear in the matching document.
-- In a boolean query with no must clauses, one or more should
-- clauses must match a document. The minimum number of should
-- clauses to match can be set using the minimum_should_match parameter.
let queryFilter = BoolFilter (ShouldMatch [(Term "user" "bitemyapp")] False)
#+END_SRC
**** Exists Filter
#+BEGIN_SRC haskell
-- Will filter for documents that have the field "user"
let existsFilter = ExistsFilter (FieldName "user")
#+END_SRC
**** Geo BoundingBox Filter
#+BEGIN_SRC haskell
-- topLeft and bottomRight
let box = GeoBoundingBox (LatLon 40.73 (-74.1)) (LatLon 40.10 (-71.12))
let constraint = GeoBoundingBoxConstraint (FieldName "tweet.location") box False
-- second argument is GeoFilterType, memory or indexed.
let geoFilter = GeoBoundingBoxFilter constraint GeoFilterMemory
#+END_SRC
**** Geo Distance Filter
#+BEGIN_SRC haskell
let geoPoint = GeoPoint (FieldName "tweet.location") (LatLon 40.12 (-71.34))
-- coefficient and units
let distance = Distance 10.0 Miles
-- GeoFilterType or NoOptimizeBbox
let optimizeBbox = OptimizeGeoFilterType GeoFilterMemory
-- SloppyArc is the usual/default optimization in Elasticsearch today
-- but pre-1.0 versions will need to pick Arc or Plane.
let geoFilter = GeoDistanceFilter geoPoint distance SloppyArc optimizeBbox False
#+END_SRC
**** Geo Distance Range Filter
Think of a donut and you won't be far off.
#+BEGIN_SRC haskell
let geoPoint = GeoPoint (FieldName "tweet.location") (LatLon 40.12 (-71.34))
let distanceRange = DistanceRange (Distance 0.0 Miles) (Distance 10.0 Miles)
let geoFilter = GeoDistanceRangeFilter geoPoint distanceRange
#+END_SRC
**** Geo Polygon Filter
#+BEGIN_SRC haskell
-- I think I drew a square here.
let points = [LatLon 40.0 (-70.00),
LatLon 40.0 (-72.00),
LatLon 41.0 (-70.00),
LatLon 41.0 (-72.00)]
let geoFilter = GeoPolygonFilter (FieldName "tweet.location") points
#+END_SRC
**** Document IDs filter
#+BEGIN_SRC haskell
-- takes a mapping name and a list of DocIds
IdsFilter (MappingName "tweet") [DocId "1"]
#+END_SRC
**** Range Filter
***** Full Range
#+BEGIN_SRC haskell
-- RangeFilter :: FieldName
-- -> Either HalfRange Range
-- -> RangeExecution
-- -> Cache -> Filter
let filter = RangeFilter (FieldName "age")
(Right (RangeLtGt (LessThan 100000.0) (GreaterThan 1000.0)))
RangeExecutionIndex False
#+END_SRC
***** Half Range
#+BEGIN_SRC haskell
let filter = RangeFilter (FieldName "age")
(Left (HalfRangeLt (LessThan 100000.0)))
RangeExecutionIndex False
#+END_SRC
***** Regexp Filter
#+BEGIN_SRC haskell
-- RegexpFilter
-- :: FieldName
-- -> Regexp
-- -> RegexpFlags
-- -> CacheName
-- -> Cache
-- -> CacheKey
-- -> Filter
let filter = RegexpFilter (FieldName "user") (Regexp "bite.*app")
RegexpAll (CacheName "test") False (CacheKey "key")
-- RegexpFlags can be a combination of RegexpAll, Complement,
-- Interval, Intersection, AnyString, and a combination of two options thereof.
#+END_SRC
2014-04-12 23:09:36 +04:00
* Possible future functionality
** Node discovery and failover
Might require TCP support.
** Support for TCP access to Elasticsearch
Pretend to be a transport client?
** Bulk cluster-join merge
Might require making a lucene index on disk with the appropriate format.
2014-04-12 23:09:36 +04:00
** GeoShapeFilter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-filter.html
2014-04-12 23:09:36 +04:00
** Geohash cell filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geohash-cell-filter.html
2014-04-12 23:09:36 +04:00
** HasChild Filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-has-child-filter.html
2014-04-12 23:09:36 +04:00
** HasParent Filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-has-parent-filter.html
2014-04-12 23:09:36 +04:00
** Indices Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-indices-filter.html
2014-04-12 23:09:36 +04:00
** Query Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-filter.html
2014-04-12 23:09:36 +04:00
** Script based sorting
2014-04-12 03:13:19 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-sort.html#_script_based_sorting
2014-04-12 23:09:36 +04:00
** Runtime checking for cycles in data structures
check for n > 1 occurrences in DFS:
http://hackage.haskell.org/package/stable-maps-0.0.5/docs/System-Mem-StableName-Dynamic.html
http://hackage.haskell.org/package/stable-maps-0.0.5/docs/System-Mem-StableName-Dynamic-Map.html
2014-04-12 14:12:17 +04:00
2014-04-12 23:09:36 +04:00
* Photo Origin
2014-04-12 14:12:17 +04:00
2014-04-14 05:25:58 +04:00
Photo from HA! Designs: https://www.flickr.com/photos/hadesigns/