Quantcast
Channel: elasticsearch
Viewing all articles
Browse latest Browse all 96

Reindexing Data with Elasticsearch

$
0
0

SIDE NOTE: We run Elasticsearch and ELK trainings, which may be of interest to you and your teammates.

Sooner or later, you’ll run into a problem of reindexing the data of your Elasticsearch instances. When we do Elasticsearch consulting for clients we always look at whether they have some way to efficiently reindex previously indexed data. The reasons for reindexing vary – from data type changes, analysis changes, to introduction of new fields that that need to be populated. No matter the case, you may either reindex from your source of truth or treat your Elasticsearch instance as such. Up to Elasticsearch 2.3 we had to use external tools to help us with this operation, like Logstash or stream2es. We even wrote about how to approach reindexing of data with Logstash. However, today we would like to look at the new functionality that will be added to Elasticsearch 2.3 – the re-index API.

The pre-requisites are quite low – you only need Elasticsearch 2.3 (not yet officially released as of this writing) and you need to be able to run a command on it. And that’s it, nothing more is needed and Elasticsearch will do the rest for us.

For the purpose of the post, we will use data that we use during Elasticsearch and Solr talks and which are available at our Github account – https://github.com/sematext/berlin-buzzwords-samples/tree/master/2014/sample-documents.

Initial data indexation

Let’s assume that we want to index the mentioned data quickly and we want to use the schema-less approach. We will just send data to an index called videosearch in a type vid by using the following command (I have the downloaded JSON files in a directory called data):

for file in data/*.json; do curl -XPOST "localhost:9200/videosearch/vid/" -d "`cat $file`"; echo; done

After the indexation we should get exactly 18 documents indexed.

The problem

Imagine that we would like to change how the data was indexed now. For example, our uploaded_by field from the data was setup to by analyzed string field. This is not perfect for aggregations – doc values won’t be used and we will get our data sliced and diced by the analysis process. Not exactly what we would probably want to have. What we need to do is to change the type of data and, of course, we can’t do that without reindexing old data. Obviously, it is easy in our case where we have our data outside Elasticsearch, but let’s assume that we don’t have it. Let’s assume our data exists only in Elasticsearch.

For example, if we try to use Kibana 4 and run the terms aggregation on the analyzed field, we would get an warning information and the following results:

analyzed

One thing to notice here are the uploader names. As you can see, they were divided on the basis of white space characters. Not exactly what we are after.

Introducing re-index API

We start with creating an index called video_new with the following mappings:

curl -XPOST 'localhost:9200/video_new' -d '{
 "mappings" : {
  "vid" : {
   "properties" : {
    "title" : { "type" : "string" },
    "uploaded_by" : { "type" : "string", "index" : "not_analyzed" }
   }
  }
 }
}'

Assuming we have the mappings done and we’ve created an index called video_new, we could run the re-indexing command. Let’s also assume that we would like to preserve the versioning of the documents. So, to reindex our data we would use the following command:

curl -XPOST 'localhost:9200/_reindex' -d '{
 "source" : {
  "index" : "videosearch"
 },
 "dest" : {
  "index" : "video_new",
  "version_type": "external"
 }
}'

As you can see we need to specify the source index (using the source section), the destination index (using the desc section) and send the command to the _reindex REST end-point. We’ve also specified the version_type and set it to external to preserve document versions. After running the command Elasticsearch should respond with the following JSON:

{
 "took" : "202.7ms",
 "timed_out" : false,
 "total" : 18,
 "updated" : 0,
 "created" : 18,
 "batches" : 1,
 "version_conflicts" : 0,
 "noops" : 0,
 "retries" : 0,
 "failures" : [ ]
}

We can see a few useful statistics about the re-indexing process here:

  • took – the amount of the the re-indexing operation took,
  • updated – number of documents updated,
  • batches – number of batches used,
  • version_conflicts – how many documents were conflicting,
  • failures – information about documents that failed to be reindexed, none in our case,
  • created – number of created documents, which is 18 in our case.

As you can see, the operation succeeded.

Now, after we’ve reindexed our data, we can again run the same terms aggregation, but on the video_new index. If we would do that using Kibana 4, we would see the following results:

not_analyzed

As you can see, now the results are different and match what we would expect.

Limiting source documents

A very nice reindex API feature is the ability to filter the source documents. For example, let’s assume that we would like to copy a part of the source documents from one index to another. Let’s copy three newest documents that have the elasticsearch term in the tags field. We can do that by using a query, limiting the size of the result, and using sorting and all that using the reindex API, like this:

curl -XPOST 'localhost:9200/_reindex' -d '{
 "size" : 3,
 "source" : {
  "index" : "videosearch",
  "type" : "vid",
  "query" : {
   "term" : {
    "tags" : "elasticsearch"
   }
  },
  "sort" : {
   "upload_date" : "desc"
  }
 },
 "dest" : {
  "index" : "video_new_sample"
 }
}'

As you can see, we’ve added the type property, which limited the documents by document type. We’ve added the query and the sort section as well, just like during the standard search operation. And we’ve limited the results to only three documents returned by the query using the size parameter. Simple, ain’t it?

Waiting for completion and timing out

Of course we can control how Elasticsearch will behave during the reindexing process. The reindexing process can take a while if you have a lot of documents to be reindexed. Thus, one of the things that we will probably want to control is whether Elasticsearch reindex request should block until the response is ready or not. To do that we can set the wait_for_completion to false. This will cause Elasticsearch to check the prerequisites for the reindexing operation and will return the task information which can be used to check the progress of reindexing.

The next thing is checking timing out. We can control how long Elasticsearch will wait for the unavailable shards to become available for each batch of documents.

We also have the ability to control routing, consistency of writes and refresh, but let’s talk about all those properties next time we will talk about the reindex API.

Summary

As you can see, we did get a tool that lets us reindex data without relying on external tools. This is especially useful when we don’t have the access to our original data or the data in Elasticsearch is modified by external processes that do not modify the original data (like comments to the documents). Just another step to have an easier life with the search engine. :)

If you need any help with Elasticsearch, check out our Elasticsearch Consulting, Elasticsearch Production Support, and Elasticsearch Training info.


Viewing all articles
Browse latest Browse all 96

Trending Articles