ElasticSearch: powerful search and analytics engine

From Deimos.fr / Bloc Notes Informatique
Jump to: navigation, search
Elastic Search

Software version 1.3
Operating System Debian 7
Website ElasticSearch Website
Last Update 12/08/2014
Others

1 Introduction

Elasticsearch[1] is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elasticsearch delivers on the near limitless promises of search technology.

2 Basics concepts

Here are a some Lucene informations that you need to know:

  • All the information of the structures are called inverted index.
  • You can't modify, only delete then insert.
  • Deletes (like on MariaDB XtraDB called "optimize") creates fragmentation. To merge data this process is called segment merge.

2.1 Input data

Data analysis is made by the analyser which is built of a tokenizer and zero or more token filters, and it can also have zero or more character mappers. A tokenizer in Lucene is used to split the text into tokens and is built of zero or more token filters.

Filters are processed sequentially. The character mappers are used before the tokenizer. For example you can remove HTML tags with it.

Notes Notes
Remove all unnecessary fields like html tags to avoid mistaken scoring

2.2 Index

A query may be not analyzed (you can decide). For example, the prfix and the term queries are not analyzed while the match query is! In ElasticSearch, an index is like a table in MariaDB. Data is stored in JSON format called a "document".

2.3 Architecture

ElasticSearch knows how to work in standalone mode or is able to work in cluster. Cluster implies Sharding + Replication: Es-cluster.png

When you send a new document to the cluster, you specify a target index and send it to one node (any of available nodes). In cluster mode, ElasticSearch gateways forwards their data to the primary node. In a cluster, there is only one writing node that can switch to another node if this one falls down.

3 Installation

To install ElasticSearch, you can take the last stable version available on the official repository. First of all install the repository key:

Command
cd /tmp
wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | apt-key add -

Add the repository file:

Configuration File
deb http://packages.elasticsearch.org/elasticsearch/1.3/debian stable main

Now install elasticsearch with the dependancies:

Command aptitude
aptitude install elasticsearch openjdk-7-jre-headless openntpd

To finish configure the init file:

Command update-rc.d
update-rc.d elasticsearch defaults 95 10

4 Configuration

4.1 File descriptors

To avoid reaching maximum file descriptor, you have to update the limits.conf file with those settings:

Configuration File /etc/security/limits.conf
    elasticsearch soft nofile 32000
    elasticsearch hard nofile 32000

4.2 JVM

Regarding the JVM parameters, it's recommended to use 1G (XMX) for small deployments. Check out your logs to see indications about OutOfMemoryError exceptions 'ES_HEAP_SIZE' variable size.

Notes Notes

You should avoid to allocate 50% of your total system memory to the JVM.


4.3 Cluster

Depending on the configuration you want to have (single or cluster), you have to edit 2 values in the default configuration file:

Configuration File /etc/elasticsearch/elasticsearch.yml
cluster.name: elasticsearch
node.name: "Node 1"

  • cluster.name: set it if you want your server to join a cluster.
  • node.name: set a hostname. If not set, it will take the server hostname.

4.4 Dynamic scripting

You may want to enable dynamic scripting[2] to do advanced query in cli. To enable it, add it in the configuration:

Configuration File /etc/elasticsearch/elasticsearch.yml
script.disable_dynamic: false

5 Administration

5.1 Check health

You can check your cluster health like this:

Command curl
> curl -XGET http://127.0.0.1:9200/_cluster/health?pretty
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0
}

5.2 Get nodes informations

To get informations regarding nodes, you can use 'cat':

Command
> curl -XGET "http://127.0.0.1:9200/_cat/nodes?v&h=name,id,ip,port,v,m"
name  id   ip            port v     m 
node1 YbCv 192.168.33.31 9300 1.2.2 m 
node2 kXy7 192.168.33.32 9300 1.2.2 m 
node3 VNK9 192.168.33.33 9300 1.2.2 *

The interesting things here are the master node (last column defined by '*').

Or you can use this:

Command curl
> curl -XGET "http://127.0.0.1:9200/_nodes/process?pretty"
{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "c8AX1atwQ6C2hl13_S0r4g" : {
      "name" : "node3",
      "transport_address" : "inet[/192.168.33.33:9300]",
      "host" : "node3",
      "ip" : "192.168.33.33",
      "version" : "1.2.2",
      "build" : "9902f08",
      "http_address" : "inet[/192.168.33.33:9200]",
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 3457,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      }
    },
    "pmsqiKGHRMGEo3iWaxv3Gw" : {
      "name" : "node1",
      "transport_address" : "inet[/192.168.33.31:9300]",
      "host" : "node1",
      "ip" : "192.168.33.31",
      "version" : "1.2.2",
      "build" : "9902f08",
      "http_address" : "inet[/192.168.33.31:9200]",
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 3480,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      }
    },
    "xVXb40pgRNKdd9G6u8-7Uw" : {
      "name" : "node2",
      "transport_address" : "inet[/192.168.33.32:9300]",
      "host" : "node2",
      "ip" : "192.168.33.32",
      "version" : "1.2.2",
      "build" : "9902f08",
      "http_address" : "inet[/192.168.33.32:9200]",
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 3886,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      }
    }
  }
}

To get more information and options, look at the official documentation[3].

5.3 Shutdown a node

To shutdown a specific node, use that curl command and replace the nodeid with the desired id number:

Command curl
> curl -XPOST http://127.0.0.1:9200/_cluster/nodes/<nodeid>/_shutdown?pretty
{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "zfnG3AKMShad0Ti9qgchFQ" : {
      "name" : "node2"
    }
  }
}

5.4 Shutdown the cluster

If you want to shutdown the whole cluster at once:

Command curl
> curl -XPOST http://127.0.0.1:9200/_cluster/nodes/_shutdown?pretty
{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "FKCjz60DRgWCat7WE9NkBQ" : {
      "name" : "node3"
    },
    "IfQBC4VrRICLyO5pNsohHA" : {
      "name" : "node1"
    },
    "kzlYH_8rRBmWXCdZIjYrlQ" : {
      "name" : "node2"
    }
  }
}

6 Usage

6.1 Create a new entry

To create a new entry with it's automated index, you simply needs to insert like this:

Command curl
> curl -XPUT http://localhost:9200/vehicule/moto/1?pretty -d '{"vendor": "Kawazaki", "model": "Z1000", "tags": ["sports", "roadster"] }'
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "1",
  "_version" : 1,
  "created" : true
}

If everything was fine, you should have "created" value to true. Each time there will be an update on the document, the version will automatically increase. If you do not specify the id, it will automatically be generated:

Command curl
> curl -XPOST http://localhost:9200/vehicule/moto/?pretty -d '{"vendor": "Kawazaki", "model": "Z1000", "tags": ["sports", "roadster"] }'
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "q1mSSqHbSqCuOHLdUGVLYQ",
  "_version" : 1,
  "created" : true
}

6.2 Get a document

To get a document (an entry), this is simple:

Command curl
> curl -XGET http://localhost:9200/vehicule/moto/1?pretty
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{"vendor": "Kawazaki", "model": "Z1000", "tags": ["sports", "roadster"] }
}

You only have to know the id. If a document is not found:

Command curl
> curl -XGET http://localhost:9200/vehicule/moto/4?pretty
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "4",
  "found" : false}

You'll get found value set to false

6.3 Update a document

Lucene doesn't know how to update a document. So when you'll ask to ElasticSearch to update a document, you will in fact delete the current and create a new one. To modify a document (here the model value), you can do it like that:

Command curl
> curl -XPOST http://localhost:9200/vehicule/moto/1/_update?pretty -d '{"script": "ctx._source.model = \"Z800\""}'
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "1",
  "_version" : 2
}

As you can see the version number has been incremented.

To add a new field to a current document:

Command curl
curl -XPOST 'localhost:9200/vehicule/moto/1/_update?pretty' -d '{
>     "script" : "ctx._source.power = \"139cv\""
> }'
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "1",
  "_version" : 11
}

If you want to add a tag in the current tag list of a document:

Command curl
> curl -XPOST 'localhost:9200/vehicule/moto/1/_update?pretty' -d '{
    "script" : "ctx._source.tags += tag",
    "params" : { 
        "tag" : "white/orange"
    }   
}'
 
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "1",
  "_version" : 10
}

6.4 Remove a document or it's content

To remove a complete document:

Command curl
> curl -XDELETE 'localhost:9200/vehicule/moto/4?pretty'
{
  "found" : true,
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "4",
  "_version" : 3
}

To remove a document field (here power):

Command curl
> curl -XPOST 'localhost:9200/vehicule/moto/1/_update?pretty' -d '{
    "script" : "ctx._source.remove(\"power\")"
}'
 
{
  "_index" : "vehicule",
  "_type" : "moto",
  "_id" : "1",
  "_version" : 13
}

ElasticSearch knows how to deal with concurrency, however if you really want to be sure to safely delete a document at a certain version, you can force it. It will fail if the document has changed in the meantime:

Command curl
> curl -XDELETE 'localhost:9200/vehicule/moto/4?version=15'

7 References

  1. ^ http://www.elasticsearch.org/overview/
  2. ^ http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
  3. ^ http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat-nodes.html