ElasticSearch: Powerful Search and Analytics Engine
Introduction
Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elasticsearch delivers on the near limitless promises of search technology.
Basics concepts
Here are a some Lucene information that you need to know:
- All the information of the structures are called inverted index.
- You can't modify, only delete then insert.
- Deletes (like on MariaDB XtraDB called "optimize") creates fragmentation. To merge data this process is called segment merge.
Input data
Data analysis is made by the analyser which is built of a tokenizer and zero or more token filters, and it can also have zero or more character mappers. A tokenizer in Lucene is used to split the text into tokens and is built of zero or more token filters.
Filters are processed sequentially. The character mappers are used before the tokenizer. For example you can remove HTML tags with it.
Info
Remove all unnecessary fields like html tags to avoid mistaken scoring
Index
A query may be not analyzed (you can decide). For example, the prefix and the term queries are not analyzed while the match query is! In ElasticSearch, an index is like a table in MariaDB. Data is stored in JSON format called a "document".
Architecture
ElasticSearch knows how to work in standalone mode or is able to work in cluster. Cluster implies Sharding + Replication:
When you send a new document to the cluster, you specify a target index and send it to one node (any of available nodes). In cluster mode, ElasticSearch gateways forwards their data to the primary node. In a cluster, there is only one writing node that can switch to another node if this one falls down.
Installation
To install ElasticSearch, you can take the last stable version available on the official repository. First of all install the repository key:
Add the repository file:
Now install elasticsearch with the dependencies:
To finish configure the init file:
Configuration
File descriptors
To avoid reaching maximum file descriptor, you have to update the limits.conf file with those settings:
JVM
Regarding the JVM parameters, it's recommended to use 1G (XMX) for small deployments. Check out your logs to see indications about OutOfMemoryError exceptions 'ES_HEAP_SIZE' variable size.
Info
You should avoid to allocate 50% of your total system memory to the JVM.
Cluster
Depending on the configuration you want to have (single or cluster), you have to edit 2 values in the default configuration file:
- cluster.name: set it if you want your server to join a cluster.
- node.name: set a hostname. If not set, it will take the server hostname.
Dynamic scripting
You may want to enable dynamic scripting to do advanced query in cli. To enable it, add it in the configuration:
Administration
Check health
You can check your cluster health like this:
Get nodes information
To get information regarding nodes, you can use 'cat':
The interesting things here are the master node (last column defined by '*').
Or you can use this:
To get more information and options, look at the official documentation.
Shutdown a node
To shutdown a specific node, use that curl command and replace the nodeid with the desired id number:
Shutdown the cluster
If you want to shutdown the whole cluster at once:
Usage
Create a new entry
To create a new entry with it's automated index, you simply needs to insert like this:
If everything was fine, you should have "created" value to true. Each time there will be an update on the document, the version will automatically increase. If you do not specify the id, it will automatically be generated:
Get a document
To get a document (an entry), this is simple:
You only have to know the id. If a document is not found:
You'll get found value set to false
Update a document
Lucene doesn't know how to update a document. So when you'll ask to ElasticSearch to update a document, you will in fact delete the current and create a new one. To modify a document (here the model value), you can do it like that:
As you can see the version number has been incremented.
To add a new field to a current document:
If you want to add a tag in the current tag list of a document:
Remove a document or it's content
To remove a complete document:
To remove a document field (here power):
ElasticSearch knows how to deal with concurrency, however if you really want to be sure to safely delete a document at a certain version, you can force it. It will fail if the document has changed in the meantime: