Today, we are happy to announce the release of Elasticsearch 1.2.0, based on Lucene 4.8.1, along with a bug fix release Elasticsearch 1.1.2.
You can download them and read the full change lists here:
Elasticsearch 1.2.0 is a bumper release, containing over 300 new features, enhancements, and bug fixes. You can see the full changes list in the Elasticsearch 1.2.0 release notes, but we will highlight some of the important ones below:
While there are a few more breaking changes than just those listed here, most of those probably won’t affect you. The following, however, are very important:
Elasticsearch now requires Java 7 and will no longer work with Java 6. We recommend using Oracle’s JDK 7u55 or JDK 7u25. Avoid any of the updates in-between as they contain a nasty bug which can cause index corruption.
Elasticsearch allows the use of scripts in several APIs: document updates, searches and aggregations. Scripts can be loaded from disk (static scripts) or specified directly within a request (dynamic scripts). Unfortunately MVEL, the current default scripting language, does not support sandboxing, meaning that a dynamic script can be used to do pretty much anything that the
elasticsearch user can do.
While it has been possible to disable dynamic scripting for a long time, we’ve decided to change the default to disable dynamic scripting out of the box. See instructions for how to reenable dynamic scripting. Watch this space for a blog post giving more details about the future of scripting in Elasticsearch.
The JVM heap has to be shared by a number of competing resources such as field data, filter caching, index buffering, aggregations, etc. Field data in particular can be greedy and, in the past, has caused a number of users to experience OOM conditions. We added the fielddata circuit breaker to try to prevent these OOMs. Initially we set the default circuit breaker limit to 80% of the heap size, but that appears to have been too generous.
We have now changed the default circuit breaker limit to 60% of the JVM heap, and the filter cache to 10% of the heap.
Note: Some Logstash users and other users of time-based indices might find that queries that worked correctly the day before have now suddenly stopped working. The reason for this failure is that the field data cache is full of old data which is no longer being used, so the circuit breaker is refusing to load more field data. This can be worked around either by clearing the caches or by setting the
indices.fielddata.cache.size (which is unbounded by default) to a value like
50% of the heap. We hope to have a better answer for this in the next release.
The shared filesystem, S3 and HDFS gateways have been deprecated for a long time, and they have finally been removed. The snapshot/restore functionality should be used instead.
The improvements in this release are heavily focussed on performance and resource usage, specifically during indexing and aggregating.
We tend to think of indexing and merging as separate functions, but really they are very closely related. The indexing process takes the docs in the indexing buffer and writes them to disk as a small segment. Having too many segments slows down indexing and searching, so the merge process merges smaller segments into bigger segments in the background. There is a balance between the size of new segments and the speed at which your changes become searchable.
Very large merges can swamp the I/O on a node, slowing down other functions like search. To control this we have merge throttling, which slows down the merge speed to 20 MB/s by default. However, it is quite possible that the indexing rate is so high that merges just can’t keep up, leading to an explosion of segments. This hurts indexing and searching.
To improve the interplay of all of these factors, we have:
- Switched back to the ConcurrentMergeScheduler (#5817) using Lucene’s default settings (#5882) and removed the SerialMergeScheduler (#6120). Merge settings can now be changed dynamically (#6098).
- Removed the flush threshold based on the number of operations in the transaction log (#5900) — now the transaction log is flushed based on size (200MB) or time (30 minutes).
- Fixed a problem in Lucene that was throttling merges much more than the configured amount (#6018).
- Allowed a backlog of merges to exert back pressure on indexing rates, so that the system is self-correcting (#6066).
Of course, it is difficult to provide good defaults both for users with spinning disks and users with SSD. If you have spinning disks, you may consider dropping
index.merge.scheduler.max_thread_count from its default value of
1. If you are just indexing, without searching, you may want to disable merge throttling completely by setting
On top of these changes, we have improved indexing performance for the typical logging use case:
Aggregations are awesome. Now we’re making them awesomely fast, and less memory hungry while we’re about it:
- Global ordinals is a data structure on top of field data or doc values, which keeps track of all the unique terms used across all segments of a shard (#5672, #5895, #5854, #5873).
This has resulted in a nice performance boost for the
significant_termsaggregations (#5895, #5994), and also speeds up parent-child joins by up to 3 times (#5846).
- Global ordinals also transforms the speed of aggregations run on doc values, which are now a viable alternative for high cardinality string fields.
- Hierarchical aggregations can use a lot of memory, so we’re being smarter about how we assign buckets (#5994) and how we collect sub-aggregations (#5975). More improvements to hierarchical aggregations will follow in the next release.
- We’ve extended the circuit breaker functionality to stop enormous requests from consuming all available memory (#6050).
We’ve also added some new functionality to aggregations:
reverse_nestedaggregation which allows you to perform aggregations on the root doc based on the contents of its nested docs (#5485).
significant_termsaggregation can now use a filtered subset of the documents of an index to provide background frequencies (#5944).
date_histogramcan now be given defined start and end points, even if there isn’t data to populate those buckets (#5444).
The completion suggester is very popular, but the number one feature request has been to allow it to perform filtering. Unfortunately, the reason it is so fast is that it does not rely on the usual search infrastructure at all, meaning that adding our normal filters would just slow it down.
The new context suggester builds on the completion suggester by allowing you to specify “contexts”, which can be the value(s) of a field or a geo-location. For instance, you could use contexts for music suggester like
popular, and a song may be both
popular. Watch this space for a blog post explaining the context suggester in detail.
Previously, it was only possible to retrieve large numbers of documents from a search request by using
?search_type=scan, which returns results unordered. The scroll API has been improved to keep track of the last document returned from every shard, meaning that deep scrolling of sorted docs is now almost as efficient as scan-scroll.
Using a field like
popularity to boost individual documents is so popular, that we decided to add a bit of sugar to the
function_score query. Instead of having to use a script like:
"script": "sqrt(factor * doc['popularity'].value)",
You can now do:
A handy side benefit of this change is that the
field_value_factor will still work when dynamic scripting is disabled.