Elasticsearch’s distributed nature allows it to search and store vast amounts of information in near realtime.
Elasticsearch provides a rich language for users to ask better questions in order to get clearer answers, significantly faster.
Elasticsearch is natively integrated with Hadoop so there is no gap for the user to bridge; we provide dedicated Input and OutputFormat for vanilla Map/Reduce, Taps for reading and writing data in Cascading, and Storages for Pig and Hive so you can access Elasticsearch just as if the data were in HDFS.
Distributed nature of the Map/Reduce model fits really well on top of Elasticsearch to correlate the number of Map/Reduce tasks with the number of Elasticsearch shards.
Elasticsearch enables Hadoop users (including Map/Reduce, Hive, Pig and Cascading) to enhance their workflow with a fullblown search engine.
Integration enables cluster colocations by exposing shard information to Hadoop. Job tasks are run on the same machines as the Elasticsearch shards themselves, eliminating network traffic and improving performance through data locality.
Elasticsearch provides near realtime responses (think milliseconds) that significantly improve a Hadoop job’s execution and the cost associated with it, especially when running on ‘rented resources’ such as Amazon EMR.