Getting started with Elasticsearch

Florian Gilcher, asquera GmbH

Getting started with Elasticsearch

Goal of this presentation

Skim the top of Elasticsearch and give you pointers on where to start and what not to ignore.

What is Elasticsearch?

Quickstart

Downloads are found at http://elasticsearch.org/downloads.

$ curl https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.0.zip
$ unzip elasticsearch-1.3.0.zip
$ cd elasticsearch-1.3.0
$ bin/elasticsearch

Welcome

[Tempus] version[1.3.0], pid[17392], build[c8714e8/2013-09-17T12:50:20Z]

Your First Data

{
  "id": 123,
  "name": "Florian Gilcher",
  "place": "Berlin",
  "birthdate": "1983-10-04T00:00:00+01:00",
  "interests": ["code", "data", "elasticsearch"],
  "age": 30
}

What can I put in there?

Elasticsearch handles:

Reference

Index

Let’s put the stuff in the database.

$ export $HOST=http://localhost:9200
$ curl -XPOST $HOST/my_index/person/123 -d @person.json
{"ok":true,"_index":"my_index","_type":"person","_id":"123","_version":1}

This operation is part of the Document API and is called “Index”.

Retrieval

Here is how we get it back:

$ curl -XGET $HOST/my_index/person/123

Result


"hits" : [ {
    "_index" : "test",
    "_type" : "person",
    "_id" : "1",
    "_score" : 1.0,
    "_source" : {  "id": 123,  "name": "Florian Gilcher",  "place": "Berlin",  "birthdate": "1983-10-04T00:00:00+01:00",  "interests": ["code", "data", "elasticsearch"],  "age": 30}
  } ]
...

Search using the DSL

Beyond debugging, the Query DSL is recommended for search queries:

{
    "query": { "match" : { "name" : "florian" } }
}

Queries and Filters

Queries can be constraint by filtering the data before running the query:

{
    "query": { "match" : { "name" : "florian" } },
    "filter": { "range" : { "age" : { "gte": 25, "lte": 35 } } }
}

Queries and Filters

The query DSL is a tiny programming language in itself and merits learning it properly.

Scoring in a nutshell

All results get ranked by a score. The score represents how good a document matches by:

Scoring can be influenced

There are multiple queries that can influence scoring.

The get go is the function_score query that can for example:

Mappings

Mappings describe how incoming values are stored in the Lucene index.

Elasticsearch automatically detects the mapping of newly added types and fields.

Example Mapping

$ curl localhost:9200/test/person/_mapping?pretty
...
"person" : {
  "properties" : {
    "age" : { "type" : "long" },
    "birthdate" : { "type" : "date", "format" : "dateOptionalTime" },
    ...
    "name" : { "type" : "string" },
  }
}

Analysis

Analysis is the step of breaking text data into terms that can be indexed.

Searches are also analyzed.

What we end up with

Lucene builds a reverse index of your data.

keyword documents
florian 1
gilcher 1,2
felix 2

How we get there

step Input Output
Whitespace Tokenizer “The quick fox” “The” “quick” “fox”
lowercase filter “The” “quick” “fox” “the” “quick” “fox”
stopword filter “the” “quick” “fox” “quick” “fox”
synoym filter “quick” “fox” “quick” “fast” “fox”

A match query for quick, fast or fox will find this document.

Analysis is important!

Getting analysis right is the difference between a good and a bad search.

Aggregations

Aggregations are split in 2 parts:

Aggregations

{  "aggregations": {
        "range": {
            "date_range": {
                "field": "date",
                "format": "MM-yyy",
                "ranges": [ { "to": "now" }, { "from": "now" } ]
...

Aggregation nesting

{  "aggregations": {
        "range": { ... },
        "aggregations": {
          "monthly" : {
              "date_histogram" : {
                  "field" : "date",
                  "interval" : "1M",
...

Metrics

{
    "aggs" : {
        "min_price" : { "min" : { "field" : "price" } }
    }
}

Metrics can be used as the last nested aggregation as well.

Distribution

Distribution works at an index-level by breaking the index into shards and distributing it over the cluster.

Node

A node is a running instance of Elasticsearch. A production system should at least consist of 2 nodes.

Elasticsearch Index

An Index stores documents. It consists of Shards. The “Elasticsearch index” is not a Lucene index.

Shard / Replica

An index is split into multiple shards (5 per default). For reliability and performance reasons, each shard is copied multiple times. These copies are called “replica”.

Distribution

The shards are distributed using the following strategy:

What do Replicas give?

Replicas allow two things:

Pitfalls

The main pitfalls when starting Elasticsearch are:

Things to definitely poke around with

Photo Credit

Cover: Bonsai Rock, Lake Tahoe

http://www.flickr.com/photos/tensafefrogs/4513403767/

Fork me on Github