ElasticSearch: powerful search and analytics engine
Contents
Software version | 1.3 |
---|---|
Operating System | Debian 7 |
Website | ElasticSearch Website |
Last Update | 12/08/2014 |
Others |
1 Introduction
Elasticsearch[1] is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elasticsearch delivers on the near limitless promises of search technology.
2 Basics concepts
Here are a some Lucene informations that you need to know:
- All the information of the structures are called inverted index.
- You can't modify, only delete then insert.
- Deletes (like on MariaDB XtraDB called "optimize") creates fragmentation. To merge data this process is called segment merge.
2.1 Input data
Data analysis is made by the analyser which is built of a tokenizer and zero or more token filters, and it can also have zero or more character mappers. A tokenizer in Lucene is used to split the text into tokens and is built of zero or more token filters.
Filters are processed sequentially. The character mappers are used before the tokenizer. For example you can remove HTML tags with it.
Notes |
Remove all unnecessary fields like html tags to avoid mistaken scoring |
2.2 Index
A query may be not analyzed (you can decide). For example, the prfix and the term queries are not analyzed while the match query is! In ElasticSearch, an index is like a table in MariaDB. Data is stored in JSON format called a "document".
2.3 Architecture
ElasticSearch knows how to work in standalone mode or is able to work in cluster. Cluster implies Sharding + Replication:
When you send a new document to the cluster, you specify a target index and send it to one node (any of available nodes). In cluster mode, ElasticSearch gateways forwards their data to the primary node. In a cluster, there is only one writing node that can switch to another node if this one falls down.
3 Installation
To install ElasticSearch, you can take the last stable version available on the official repository. First of all install the repository key:
cd /tmp wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | apt-key add - |
Add the repository file:
deb http://packages.elasticsearch.org/elasticsearch/1.3/debian stable main |
Now install elasticsearch with the dependancies:
aptitude |
aptitude install elasticsearch openjdk-7-jre-headless openntpd |
To finish configure the init file:
update-rc.d |
update-rc.d elasticsearch defaults 95 10 |
4 Configuration
4.1 File descriptors
To avoid reaching maximum file descriptor, you have to update the limits.conf file with those settings:
/etc/security/limits.conf |
elasticsearch soft nofile 32000 elasticsearch hard nofile 32000 |
4.2 JVM
Regarding the JVM parameters, it's recommended to use 1G (XMX) for small deployments. Check out your logs to see indications about OutOfMemoryError exceptions 'ES_HEAP_SIZE' variable size.
Notes |
You should avoid to allocate 50% of your total system memory to the JVM. |
4.3 Cluster
Depending on the configuration you want to have (single or cluster), you have to edit 2 values in the default configuration file:
/etc/elasticsearch/elasticsearch.yml |
cluster.name: elasticsearch node.name: "Node 1" |
- cluster.name: set it if you want your server to join a cluster.
- node.name: set a hostname. If not set, it will take the server hostname.
4.4 Dynamic scripting
You may want to enable dynamic scripting[2] to do advanced query in cli. To enable it, add it in the configuration:
/etc/elasticsearch/elasticsearch.yml |
script.disable_dynamic: false |
5 Administration
5.1 Check health
You can check your cluster health like this:
5.2 Get nodes informations
To get informations regarding nodes, you can use 'cat':
The interesting things here are the master node (last column defined by '*').
Or you can use this:
To get more information and options, look at the official documentation[3].
5.3 Shutdown a node
To shutdown a specific node, use that curl command and replace the nodeid with the desired id number:
curl |
> curl -XPOST http://127.0.0.1:9200/_cluster/nodes/<nodeid>/_shutdown?pretty { "cluster_name" : "elasticsearch", "nodes" : { "zfnG3AKMShad0Ti9qgchFQ" : { "name" : "node2" } } } |
5.4 Shutdown the cluster
If you want to shutdown the whole cluster at once:
6 Usage
6.1 Create a new entry
To create a new entry with it's automated index, you simply needs to insert like this:
If everything was fine, you should have "created" value to true. Each time there will be an update on the document, the version will automatically increase. If you do not specify the id, it will automatically be generated:
6.2 Get a document
To get a document (an entry), this is simple:
You only have to know the id. If a document is not found:
curl |
> curl -XGET http://localhost:9200/vehicule/moto/4?pretty { "_index" : "vehicule", "_type" : "moto", "_id" : "4", "found" : false} |
You'll get found value set to false
6.3 Update a document
Lucene doesn't know how to update a document. So when you'll ask to ElasticSearch to update a document, you will in fact delete the current and create a new one. To modify a document (here the model value), you can do it like that:
curl |
> curl -XPOST http://localhost:9200/vehicule/moto/1/_update?pretty -d '{"script": "ctx._source.model = \"Z800\""}' { "_index" : "vehicule", "_type" : "moto", "_id" : "1", "_version" : 2 } |
As you can see the version number has been incremented.
To add a new field to a current document:
curl |
curl -XPOST 'localhost:9200/vehicule/moto/1/_update?pretty' -d '{ > "script" : "ctx._source.power = \"139cv\"" > }' { "_index" : "vehicule", "_type" : "moto", "_id" : "1", "_version" : 11 } |
If you want to add a tag in the current tag list of a document:
6.4 Remove a document or it's content
To remove a complete document:
curl |
> curl -XDELETE 'localhost:9200/vehicule/moto/4?pretty' { "found" : true, "_index" : "vehicule", "_type" : "moto", "_id" : "4", "_version" : 3 } |
To remove a document field (here power):
curl |
> curl -XPOST 'localhost:9200/vehicule/moto/1/_update?pretty' -d '{ "script" : "ctx._source.remove(\"power\")" }' { "_index" : "vehicule", "_type" : "moto", "_id" : "1", "_version" : 13 } |
ElasticSearch knows how to deal with concurrency, however if you really want to be sure to safely delete a document at a certain version, you can force it. It will fail if the document has changed in the meantime:
curl |
> curl -XDELETE 'localhost:9200/vehicule/moto/4?version=15' |