bloodhound/README.org

712 lines
19 KiB
Org Mode
Raw Normal View History

2014-04-07 22:24:58 +04:00
* Bloodhound
2014-04-12 14:14:27 +04:00
#+CAPTION: Bloodhound (dog)
[[./bloodhound.jpg]]
2014-05-03 11:46:59 +04:00
#+CAPTION: Build Status
2014-05-03 11:55:30 +04:00
[[https://travis-ci.org/bitemyapp/bloodhound][https://travis-ci.org/bitemyapp/bloodhound.svg]]
2014-05-03 09:34:24 +04:00
2014-04-12 23:09:36 +04:00
* Elasticsearch client and query DSL for Haskell
** Why?
2014-04-15 12:20:14 +04:00
Search doesn't have to be hard. Let the dog do it.
2014-10-01 11:21:29 +04:00
** Endorsements
"Bloodhound makes Elasticsearch almost tolerable!" - Almost-gruntled user
** Version compatibility
2014-08-16 03:36:10 +04:00
Elasticsearch >= 1.0 is recommended. Bloodhound mostly works with 0.9.x, but I don't recommend it if you expect everything to work. As of Bloodhound 0.3 all >=1.0 versions of Elasticsearch work.
2014-08-16 03:57:22 +04:00
Current versions we test against are 1.0.3, 1.1.2, 1.2.3, and 1.3.2. We also check that GHC 7.6 and 7.8 both build and pass tests. See our [TravisCI](https://travis-ci.org/bitemyapp/bloodhound) to learn more.
2014-04-12 23:09:36 +04:00
** Stability
2014-08-16 03:58:30 +04:00
Bloodhound is beta at the moment. We've got a solid library tested across multiple ES versions, but coverage of Elasticsearch functionality isn't 100% yet.
2014-04-12 23:09:36 +04:00
2014-05-04 02:39:10 +04:00
* Hackage page and Haddock documentation
http://hackage.haskell.org/package/bloodhound
2014-04-12 23:09:36 +04:00
* Examples
** Index Operations
*** Create Index
#+BEGIN_SRC haskell
-- Formatted for use in ghci, so there are "let"s in front of the decls.
2014-04-14 05:34:01 +04:00
-- if you see :{ and :}, they're so you can copy-paste
-- the multi-line examples into your ghci REPL.
2014-04-14 05:16:44 +04:00
:set -XDeriveGeneric
2014-04-15 12:10:47 +04:00
import Database.Bloodhound
import Data.Aeson
import Data.Either (Either(..))
import Data.Maybe (fromJust)
import Data.Time.Calendar (Day(..))
import Data.Time.Clock (secondsToDiffTime, UTCTime(..))
import Data.Text (Text)
import GHC.Generics (Generic)
import Network.HTTP.Conduit
import qualified Network.HTTP.Types.Status as NHTS
-- no trailing slashes in servers, library handles building the path.
let testServer = (Server "http://localhost:9200")
let testIndex = IndexName "twitter"
let testMapping = MappingName "tweet"
2014-04-15 12:10:47 +04:00
-- defaultIndexSettings is exported by Database.Bloodhound as well
let defaultIndexSettings = IndexSettings (ShardCount 3) (ReplicaCount 2)
-- createIndex returns IO Reply
-- response :: Reply, Reply is a synonym for Network.HTTP.Conduit.Response
response <- createIndex testServer defaultIndexSettings testIndex
#+END_SRC
*** Delete Index
2014-05-03 07:57:20 +04:00
**** Code
#+BEGIN_SRC haskell
-- response :: Reply
response <- deleteIndex testServer testIndex
2014-05-03 07:57:20 +04:00
#+END_SRC
**** Example Response
#+BEGIN_SRC haskell
-- print response if it was a success
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length", "21")]
, responseBody = "{\"acknowledged\":true}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
-- if the index to be deleted didn't exist anyway
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 404, statusMessage = "Not Found"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length","65")]
, responseBody = "{\"error\":\"IndexMissingException[[twitter] missing]\",\"status\":404}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
#+END_SRC
*** Refresh Index
**** Note, you *have* to do this if you expect to read what you just wrote
#+BEGIN_SRC haskell
resp <- refreshIndex testServer testIndex
2014-05-03 07:57:20 +04:00
#+END_SRC
**** Example Response
#+BEGIN_SRC haskell
-- print resp on success
2014-04-14 05:23:03 +04:00
Response {responseStatus = Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1
, responseHeaders = [("Content-Type", "application/json; charset=UTF-8")
, ("Content-Length","50")]
, responseBody = "{\"_shards\":{\"total\":10,\"successful\":5,\"failed\":0}}"
, responseCookieJar = CJ {expose = []}
, responseClose' = ResponseClose}
#+END_SRC
** Mapping Operations
*** Create Mapping
#+BEGIN_SRC haskell
-- don't forget imports and the like at the top.
data TweetMapping = TweetMapping deriving (Eq, Show)
2014-04-14 05:24:23 +04:00
-- I know writing the JSON manually sucks.
-- I don't have a proper data type for Mappings yet.
-- Let me know if this is something you need.
:{
instance ToJSON TweetMapping where
toJSON TweetMapping =
object ["tweet" .=
object ["properties" .=
object ["location" .=
object ["type" .= ("geo_point" :: Text)]]]]
:}
resp <- putMapping testServer testIndex testMapping TweetMapping
#+END_SRC
*** Delete Mapping
#+BEGIN_SRC haskell
resp <- deleteMapping testServer testIndex testMapping
#+END_SRC
** Document Operations
*** Indexing Documents
#+BEGIN_SRC haskell
2014-04-14 05:19:18 +04:00
-- don't forget the imports and derive generic setting for ghci
-- at the beginning of the examples.
:{
data Location = Location { lat :: Double
, lon :: Double } deriving (Eq, Generic, Show)
data Tweet = Tweet { user :: Text
, postDate :: UTCTime
, message :: Text
, age :: Int
, location :: Location } deriving (Eq, Generic, Show)
exampleTweet = Tweet { user = "bitemyapp"
, postDate = UTCTime
(ModifiedJulianDay 55000)
(secondsToDiffTime 10)
, message = "Use haskell!"
, age = 10000
, location = Location 40.12 (-71.34) }
-- automagic (generic) derivation of instances because we're lazy.
instance ToJSON Tweet
instance FromJSON Tweet
instance ToJSON Location
instance FromJSON Location
:}
-- Should be able to toJSON and encode the data structures like this:
-- λ> toJSON $ Location 10.0 10.0
-- Object fromList [("lat",Number 10.0),("lon",Number 10.0)]
-- λ> encode $ Location 10.0 10.0
-- "{\"lat\":10,\"lon\":10}"
resp <- indexDocument testServer testIndex testMapping exampleTweet (DocId "1")
2014-05-03 07:57:20 +04:00
#+END_SRC
**** Example Response
#+BEGIN_SRC haskell
2014-04-14 05:19:18 +04:00
Response {responseStatus =
Status {statusCode = 200, statusMessage = "OK"}
, responseVersion = HTTP/1.1, responseHeaders =
2014-04-14 05:19:18 +04:00
[("Content-Type","application/json; charset=UTF-8"),
("Content-Length","75")]
, responseBody = "{\"_index\":\"twitter\",\"_type\":\"tweet\",\"_id\":\"1\",\"_version\":2,\"created\":false}"
, responseCookieJar = CJ {expose = []}, responseClose' = ResponseClose}
#+END_SRC
*** Deleting Documents
#+BEGIN_SRC haskell
resp <- deleteDocument testServer testIndex testMapping (DocId "1")
#+END_SRC
*** Getting Documents
#+BEGIN_SRC haskell
-- n.b., you'll need the earlier imports. responseBody is from http-conduit
resp <- getDocument testServer testIndex testMapping (DocId "1")
-- responseBody :: Response body -> body
let body = responseBody resp
-- you have two options, you use decode and just get Maybe (EsResult Tweet)
-- or you can use eitherDecode and get Either String (EsResult Tweet)
let maybeResult = decode body :: Maybe (EsResult Tweet)
-- the explicit typing is so Aeson knows how to parse the JSON.
-- use either if you want to know why something failed to parse.
-- (string errors, sadly)
let eitherResult = decode body :: Either String (EsResult Tweet)
-- print eitherResult should look like:
2014-04-14 05:20:47 +04:00
Right (EsResult {_index = "twitter"
, _type = "tweet"
, _id = "1"
, _version = 2
, found = Just True
, _source = Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}}})
-- _source in EsResult is parametric, we dispatch the type by passing in what we expect (Tweet) as a parameter to EsResult.
2014-04-14 05:16:44 +04:00
-- use the _source record accessor to get at your document
λ> fmap _source result
2014-04-14 05:20:47 +04:00
Right (Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}})
2014-04-14 05:16:44 +04:00
#+END_SRC
** Bulk Operations
*** Bulk create, index
#+BEGIN_SRC haskell
-- don't forget the imports and derive generic setting for ghci
-- at the beginning of the examples.
:{
-- Using the earlier Tweet datatype and exampleTweet data
-- just changing up the data a bit.
let bulkTest = exampleTweet { user = "blah" }
let bulkTestTwo = exampleTweet { message = "woohoo!" }
-- create only bulk operation
-- BulkCreate :: IndexName -> MappingName -> DocId -> Value -> BulkOperation
let firstOp = BulkCreate testIndex
testMapping (DocId "3") (toJSON bulkTest)
-- index operation "create or update"
let sndOp = BulkIndex testIndex
testMapping (DocId "4") (toJSON bulkTestTwo)
-- Some explanation, the final "Value" type that BulkIndex,
-- BulkCreate, and BulkUpdate accept is the actual document
-- data that your operation applies to. BulkDelete doesn't
-- take a value because it's just deleting whatever DocId
-- you pass.
-- list of bulk operations
let stream = [firstDoc, secondDoc]
-- Fire off the actual bulk request
-- bulk :: Server -> [BulkOperation] -> IO Reply
resp <- bulk testServer stream
:}
#+END_SRC
** Search
*** Querying
2014-04-14 05:16:44 +04:00
**** Term Query
#+BEGIN_SRC haskell
-- exported by the Client module, just defaults some stuff.
-- mkSearch :: Maybe Query -> Maybe Filter -> Search
-- mkSearch query filter = Search query filter Nothing False 0 10
let query = TermQuery (Term "user" "bitemyapp") Nothing
-- AND'ing identity filter with itself and then tacking it onto a query
-- search should be a null-operation. I include it for the sake of example.
-- <||> (or/plus) should make it into a search that returns everything.
let filter = IdentityFilter <&&> IdentityFilter
2014-04-14 05:27:31 +04:00
-- constructing the search object the searchByIndex function dispatches on.
2014-04-14 05:16:44 +04:00
let search = mkSearch (Just query) (Just filter)
2014-04-14 05:27:31 +04:00
-- you can also searchByType and specify the mapping name.
2014-04-14 05:16:44 +04:00
reply <- searchByIndex testServer testIndex search
2014-04-14 05:27:31 +04:00
2014-04-14 05:16:44 +04:00
let result = eitherDecode (responseBody reply) :: Either String (SearchResult Tweet)
λ> fmap (hits . searchHits) result
2014-04-14 05:25:26 +04:00
Right [Hit {hitIndex = IndexName "twitter"
, hitType = MappingName "tweet"
, hitDocId = DocId "1"
, hitScore = 0.30685282
, hitSource = Tweet {user = "bitemyapp"
, postDate = 2009-06-18 00:00:10 UTC
, message = "Use haskell!"
, age = 10000
, location = Location {lat = 40.12, lon = -71.34}}}]
2014-04-14 05:16:44 +04:00
#+END_SRC
2014-05-03 08:02:55 +04:00
**** Match Query
#+BEGIN_SRC haskell
let query = QueryMatchQuery $ mkMatchQuery (FieldName "user") (QueryString "bitemyapp")
let search = mkSearch (Just query) Nothing
#+END_SRC
**** Multi-Match Query
#+BEGIN_SRC haskell
let fields = [FieldName "user", FieldName "message"]
let query = QueryMultiMatchQuery $ mkMultiMatchQuery fields (QueryString "bitemyapp")
let search = mkSearch (Just query) Nothing
#+END_SRC
**** Bool Query
#+BEGIN_SRC haskell
let innerQuery = QueryMatchQuery $
mkMatchQuery (FieldName "user") (QueryString "bitemyapp")
let query = QueryBoolQuery $
mkBoolQuery (Just innerQuery) Nothing Nothing
let search = mkSearch (Just query) Nothing
#+END_SRC
**** Boosting Query
#+BEGIN_SRC haskell
2014-05-03 08:03:24 +04:00
let posQuery = QueryMatchQuery $
mkMatchQuery (FieldName "user") (QueryString "bitemyapp")
let negQuery = QueryMatchQuery $
mkMatchQuery (FieldName "user") (QueryString "notmyapp")
let query = QueryBoostingQuery $
BoostingQuery posQuery negQuery (Boost 0.2)
2014-05-03 08:02:55 +04:00
#+END_SRC
2014-05-03 08:04:09 +04:00
**** Rest of the query/filter types
2014-05-03 08:02:55 +04:00
Just follow the pattern you've seen here and check the Hackage API documentation.
2014-04-15 12:10:47 +04:00
*** Sorting
#+BEGIN_SRC haskell
2014-04-15 12:16:15 +04:00
let sortSpec = DefaultSortSpec $ mkSort (FieldName "age") Ascending
-- mkSort is a shortcut function that takes a FieldName and a SortOrder
-- to generate a vanilla DefaultSort.
-- checkt the DefaultSort type for the full list of customizable options.
-- From and size are integers for pagination.
-- When sorting on a field, scores are not computed. By setting TrackSortScores to true, scores will still be computed and tracked.
-- type Sort = [SortSpec]
-- type TrackSortScores = Bool
-- type From = Int
-- type Size = Int
-- Search takes Maybe Query
-- -> Maybe Filter
-- -> Maybe Sort
-- -> TrackSortScores
-- -> From -> Size
2014-04-15 12:17:13 +04:00
-- just add more sortspecs to the list if you want tie-breakers.
2014-04-15 12:16:15 +04:00
let search = Search Nothing (Just IdentityFilter) (Just [sortSpec]) False 0 10
2014-04-15 12:10:47 +04:00
#+END_SRC
*** Filtering
2014-04-14 08:30:21 +04:00
**** And, Not, and Or filters
Filters form a monoid and seminearring.
#+BEGIN_SRC haskell
instance Monoid Filter where
mempty = IdentityFilter
mappend a b = AndFilter [a, b] defaultCache
instance Seminearring Filter where
a <||> b = OrFilter [a, b] defaultCache
-- AndFilter and OrFilter take [Filter] as an argument.
-- This will return anything, because IdentityFilter returns everything
OrFilter [IdentityFilter, someOtherFilter] False
-- This will return exactly what someOtherFilter returns
AndFilter [IdentityFilter, someOtherFilter] False
-- Thanks to the seminearring and monoid, the above can be expressed as:
-- "and"
IdentityFilter <&&> someOtherFilter
-- "or"
IdentityFilter <||> someOtherFilter
-- Also there is a NotFilter, it only accepts a single filter, not a list.
NotFilter someOtherFilter False
#+END_SRC
**** Identity Filter
#+BEGIN_SRC haskell
-- And'ing two Identity
let queryFilter = IdentityFilter <&&> IdentityFilter
let search = mkSearch Nothing (Just queryFilter)
reply <- searchByType testServer testIndex testMapping search
#+END_SRC
**** Boolean Filter
Similar to boolean queries.
#+BEGIN_SRC haskell
-- Will return only items whose "user" field contains the term "bitemyapp"
let queryFilter = BoolFilter (MustMatch (Term "user" "bitemyapp") False)
-- Will return only items whose "user" field does not contain the term "bitemyapp"
let queryFilter = BoolFilter (MustNotMatch (Term "user" "bitemyapp") False)
-- The clause (query) should appear in the matching document.
-- In a boolean query with no must clauses, one or more should
-- clauses must match a document. The minimum number of should
-- clauses to match can be set using the minimum_should_match parameter.
let queryFilter = BoolFilter (ShouldMatch [(Term "user" "bitemyapp")] False)
#+END_SRC
**** Exists Filter
#+BEGIN_SRC haskell
-- Will filter for documents that have the field "user"
let existsFilter = ExistsFilter (FieldName "user")
#+END_SRC
**** Geo BoundingBox Filter
#+BEGIN_SRC haskell
-- topLeft and bottomRight
let box = GeoBoundingBox (LatLon 40.73 (-74.1)) (LatLon 40.10 (-71.12))
let constraint = GeoBoundingBoxConstraint (FieldName "tweet.location") box False GeoFilterMemory
2014-04-14 08:30:21 +04:00
#+END_SRC
**** Geo Distance Filter
#+BEGIN_SRC haskell
let geoPoint = GeoPoint (FieldName "tweet.location") (LatLon 40.12 (-71.34))
-- coefficient and units
let distance = Distance 10.0 Miles
-- GeoFilterType or NoOptimizeBbox
let optimizeBbox = OptimizeGeoFilterType GeoFilterMemory
-- SloppyArc is the usual/default optimization in Elasticsearch today
-- but pre-1.0 versions will need to pick Arc or Plane.
let geoFilter = GeoDistanceFilter geoPoint distance SloppyArc optimizeBbox False
#+END_SRC
**** Geo Distance Range Filter
Think of a donut and you won't be far off.
#+BEGIN_SRC haskell
let geoPoint = GeoPoint (FieldName "tweet.location") (LatLon 40.12 (-71.34))
let distanceRange = DistanceRange (Distance 0.0 Miles) (Distance 10.0 Miles)
let geoFilter = GeoDistanceRangeFilter geoPoint distanceRange
#+END_SRC
**** Geo Polygon Filter
#+BEGIN_SRC haskell
-- I think I drew a square here.
let points = [LatLon 40.0 (-70.00),
LatLon 40.0 (-72.00),
LatLon 41.0 (-70.00),
LatLon 41.0 (-72.00)]
let geoFilter = GeoPolygonFilter (FieldName "tweet.location") points
#+END_SRC
**** Document IDs filter
#+BEGIN_SRC haskell
-- takes a mapping name and a list of DocIds
IdsFilter (MappingName "tweet") [DocId "1"]
#+END_SRC
**** Range Filter
***** Full Range
#+BEGIN_SRC haskell
-- RangeFilter :: FieldName
-- -> Either HalfRange Range
-- -> RangeExecution
-- -> Cache -> Filter
let filter = RangeFilter (FieldName "age")
(Right (RangeLtGt (LessThan 100000.0) (GreaterThan 1000.0)))
RangeExecutionIndex False
#+END_SRC
***** Half Range
#+BEGIN_SRC haskell
let filter = RangeFilter (FieldName "age")
(Left (HalfRangeLt (LessThan 100000.0)))
RangeExecutionIndex False
#+END_SRC
2014-04-14 08:41:34 +04:00
**** Regexp Filter
2014-04-14 08:30:21 +04:00
#+BEGIN_SRC haskell
-- RegexpFilter
-- :: FieldName
-- -> Regexp
-- -> RegexpFlags
-- -> CacheName
-- -> Cache
-- -> CacheKey
-- -> Filter
let filter = RegexpFilter (FieldName "user") (Regexp "bite.*app")
2014-06-20 20:22:03 +04:00
AllRegexpFlags (CacheName "test") False (CacheKey "key")
-- n.b.
-- data RegexpFlags = AllRegexpFlags
-- | NoRegexpFlags
-- | SomeRegexpFlags (NonEmpty RegexpFlag) deriving (Eq, Show)
-- data RegexpFlag = AnyString
-- | Automaton
-- | Complement
-- | Empty
-- | Intersection
-- | Interval deriving (Eq, Show)
2014-04-14 08:30:21 +04:00
#+END_SRC
2014-04-12 23:09:36 +04:00
* Possible future functionality
** Span Queries
Beginning here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-span-first-query.html
2014-04-19 15:53:39 +04:00
** Function Score Query
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
** Node discovery and failover
Might require TCP support.
** Support for TCP access to Elasticsearch
Pretend to be a transport client?
** Bulk cluster-join merge
Might require making a lucene index on disk with the appropriate format.
** GeoShapeQuery
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-query.html
2014-04-12 23:09:36 +04:00
** GeoShapeFilter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-filter.html
2014-04-12 23:09:36 +04:00
** Geohash cell filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geohash-cell-filter.html
2014-04-12 23:09:36 +04:00
** HasChild Filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-has-child-filter.html
2014-04-12 23:09:36 +04:00
** HasParent Filter
2014-04-11 05:04:24 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-has-parent-filter.html
2014-04-12 23:09:36 +04:00
** Indices Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-indices-filter.html
2014-04-12 23:09:36 +04:00
** Query Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-filter.html
2014-04-12 23:09:36 +04:00
** Script based sorting
2014-04-12 03:13:19 +04:00
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-sort.html#_script_based_sorting
2014-04-15 05:23:40 +04:00
** Collapsing redundantly nested and/or structures
The Seminearring instance, if deeply nested can possibly produce nested structure that is redundant. Depending on how this affects ES perforamnce, reducing this structure might be valuable.
2014-04-12 23:09:36 +04:00
** Runtime checking for cycles in data structures
check for n > 1 occurrences in DFS:
http://hackage.haskell.org/package/stable-maps-0.0.5/docs/System-Mem-StableName-Dynamic.html
http://hackage.haskell.org/package/stable-maps-0.0.5/docs/System-Mem-StableName-Dynamic-Map.html
2014-04-12 14:12:17 +04:00
2014-04-12 23:09:36 +04:00
* Photo Origin
2014-04-12 14:12:17 +04:00
2014-04-14 05:25:58 +04:00
Photo from HA! Designs: https://www.flickr.com/photos/hadesigns/