In the previous part of Solr vs. ElasticSearch series we talked about general architecture of these two great search engines based on Apache Lucene. Today, we will look at their ability to handle your data and perform indexing and language analysis.
- Solr vs. ElasticSearch: Part 1 - Overview
- Solr vs. ElasticSearch: Part 2 - Indexing and Language Handling
- Solr vs. ElasticSearch: Part 3 - Searching
- Solr vs. ElasticSearch: Part 4 - Faceting
- Solr vs. ElasticSearch: Part 5 - Management API Capabilities
- Solr vs. ElasticSearch: Part 6 – User & Dev Communities Compared
Data Indexing
Apart from using Java API exposed both by ElasticSearch and Apache Solr, you can index data using an HTTP call. To index data in ElasticSearch you need to prepare your data in JSON format. Solr also allows that, but in addition to that, it lets you to use other formats like the default XML or CSV. Importantly, indexing data in different formats has different performance characteristics, but that comes with some limitations. For example, indexing documents in CSV format is considered to be the fastest, but you can’t use field value boosting while using that format. Of course, one will usually use some kind of a library or Java API to index data as one doesn’t typically store data in a way that allows indexing of data straight into the search engine (at least in most cases that’s true).
More About ElasticSearch
It is worth noting that ElasticSearch supports two additional things, that Solr does not – nested documents and multiple document types inside a single index.
The nested documents functionality lets you create more than a flat document structure. For example, imagine you index documents that are bound to some group of users. In addition to document contents, you would like to store which users can access that document. And this is were we run into a little problem – this data changes over time. If you were to store document content and users inside a single index document, you would have to reindex the whole document every time the list of users who can access it changes in any way. Luckily, with ElasticSearch you don’t have to do that – you can use nested document types and then use appropriate queries for matching. In this example, a nested document would hold a lists of users with document access rights. Internally, nested documents are indexed as separate index documents stored inside the same index. ElasticSearch ensures they are indexed in a way that allows it to use fast join operations to get them. In addition to that, these documents are not shown when using standard queries and you have to use nested query to get them, a very handy feature.
Multiple types of documents per index allow just what the name says – you can index different types of documents inside the same index. This is not possible with Solr, as you have only one schema in Solr per core. In ElasticSearch you can filter, query, or facet on document types. You can make queries against all document types or just choose a single document type (both with Java API and REST).
Index Manipulation
Let’s look at the ability to manage your indices/collections using the HTTP API of both Apache Solr and ElasticSearch.
Solr
Solr let’s you control all cores that live inside your cluster with the CoreAdmin API – you can create cores, rename, reload, or even merge them into another core. In addition to the CoreAdmin API Solr enables you to use the collections API to create, delete or reload a collection. The collections API uses CoreAdmin API under the hood, but it’s a simpler way to control your collections. Remember that you need to have your configuration pushed into ZooKeeper ensemble in order to create a collection with a new configuration.
When it comes to Solr, there is additional functionality that is in early stages of work, although it’s functional – the ability to split your shards. After applying the patch available in SOLR-3755 you can use a SPLIT action to split your index and write it to two separate cores. If you look at the mentioned JIRA issue, you’ll see that once this is commited Solr will have the ability not only to create new replicas, but also to dynamically re-shard the indices. This is huge!
ElasticSearch
One of the great things in ElasticSearch is the ability to control your indices using HTTP API. We will take about it extensively in the last part of the series, but I have to mention it ere, too. In ElasticSearch you can create indices on the live cluster and delete them. During creation you can specify the number of shards an index should have and you can decrease and increase the number of replicas without anything more than a single API call. You cannot change the number of shards yet. Of course, you can also define mappings and analyzers during index creation, so you have all the control you need to index a new type of data into you cluster.
Partial Document Updates
Both search engines support partial document update. This is not the true partial document update that everyone has been after for years – this is really just normal document reindexing, but performed on the search engine side, so it feels like a real update.
Solr
Let’s start from the requirements – because this functionality reconstructs the document on the server side you need to have your fields set as stored and you have to have the _version_ field available in your index structure. Then you can update a document with a simple API call, for example:
curl 'localhost:8983/solr/update' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
ElasticSearch
In case of ElasticSearch you need to have the _source field enabled for the partial update functionality to work. This _source is a special ElasticSearch field that stores the original JSON document. Theis functionality doesn’t have add/set/delete command, but instead lets you use script to modify a document. For example, the following command updates the same document that we updated with the above Solr request:
curl -XPOST 'localhost:9200/sematext/doc/1/_update'-d '{ "script" : "ctx._source.price = price", "params" : { "price" : 100 } }'
Multilingual Data Handling
As we mentioned previously, and as you probably know, both ElasticSearch and Solr use Apache Lucene to index and search data. But, of course, each search engine has its own Java implementation that interacts with Lucene. This is also the case when it comes to language handling. Apache Solr 4.0 beta has the advantage over ElasticSearch because it can handle more languages out of the box. For example, my native language Polish is supported by Solr out of the box (with two different filters for stemming), but not by ElasticSearch. On the other hand, there are many plugins for ElasticSearch that enable support for languages not supported by default, though still not as many as we can find supported in Solr out of the box. It’s also worth mentioning there are commercial analyzers that plug into Solr (and Lucene), but none that we are aware of work with ElasticSearch…. yet.
Supported Languages
For the full list of languages supported by those two search engine please refer to the following pages:
- Apache Solr
- ElasticSearch
- Analyzers: http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html
- Stemming: http://www.elasticsearch.org/guide/reference/index-modules/analysis/stemmer-tokenfilter.html, http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-tokenfilter.html and http://www.elasticsearch.org/guide/reference/index-modules/analysis/kstem-tokenfilter.html
Analysis Chain Definition
Of course, both Apache Solr and ElasticSearch allow you to define a custom analysis chain by specifying your own analyzer/tokenizer and list of filters that should be used to process your data. However, the difference between ElasticSearch and Solr is not only in the list of supported languages. ElasticSearch allows one to specify the analyzer per document and per query. So, if you need to use a different analyzer for each document in the index you can do that in ElasticSearch. The same applies to queries – each query can use a different analyzer.
Results Grouping
One of the most requested features for Apache Solr was result grouping. It was highly anticipated for Solr and it is still anticipated for ElasticSearch, which doesn’t yet have field grouping as of this writing. You can see the number of +1 votes in the following issue: https://github.com/elasticsearch/elasticsearch/issues/256. You can expect grouping to be supported in ElasticSearch after changes introduced in 0.20. If you are not familiar with results grouping – it allows you to group results based on the value of a field, value of a query, or a function and return matching documents as groups. You can imagine grouping results of restaurants on the value of the city field and returning only five restaurants for each city. A feature like this may be handy in some situations. Currently, for the search engines we are talking about, only Apache Solr supports results grouping out of the box.
Prospective Search
One thing Apache Solr completely lacks when comparing to ElasticSearch is functionality called Percolator in ElasticSearch. Imagine a search engine that, instead of storing documents in the index, stores queries and lets you check which stored/indexed queries match each new document being indexed. Sound handy, right? For example, this is useful when people want to watch out for any new documents (think Social Media, News, etc.) matching their topics of interest, as described through queries. This functionality is also called Prospective Search, some call it Pub-Sub as well as Stored Searches. At Sematext we’ve implemented this a few times for our clients using Solr, but ElasticSearch has this functionality built-in. If you want to know more about ElasticSearch Percolator see http://www.elasticsearch.org/blog/2011/02/08/percolator.html.
What’s Next ?
In the next part of the series we will focus on comparing the ability to query your indices and leverage the full text search capabilities of Apache Solr and ElasticSearch. We will also look at the possibility to influence Lucene scoring algorithms during query time. Till next time
