For low-level or performance-sensitive environments, elasticsearch-hadoop provides dedicated
OutputFormat implementations that can read and write data to Elasticsearch. The two IO interfaces will automatically convert JSON documents to
Writable objects and vice-versa.
In order to use elasticsearch-hadoop, the jar needs to be available to the job class path. At ~
250kB and without any dependencies, the jar can be either bundled in the job archive, manually or through CLI Generic Options (if your jar implements the Tool), be distributed through Hadoop’s DistributedCache or made available by provisioning the cluster manually.
All the options above affect only the code running on the distributed nodes. If your code that launches the Hadoop job refers to elasticsearch-hadoop, make sure to include the JAR in the
$ bin/hadoop jar myJar.jar -libjars elasticsearch-hadoop.jar
If automatic index creation is used, please review this section for more information.
elasticsearch-hadoop automatically converts Hadoop built-in
Writable types to Elasticsearch types (and back) as shown in the table below:
Writable Conversion Table
| ||Elasticsearch type|
Available only in Apache Hadoop 1.x
EsOutputFormat expects a
Map<Writable, Writable> value that it will convert into a JSON document; the key is ignored.
To write data to ES, use
org.elasticsearch.hadoop.mr.EsOutputFormat on your job along with the relevant configuration properties:
JobConf conf = new JobConf(); conf.setSpeculativeExecution(false); conf.set("es.resource", "radio/artists"); conf.setOutputFormat(EsOutputFormat.class); ... JobClient.runJob(conf);
For cases where the job output data is already in JSON, elasticsearch-hadoop allows direct indexing without applying any transformation; the data is taken as is and sent directly to Elasticsearch. In such cases, one needs to indicate the json input by setting
es.input.json parameter. As such, in this case elasticsearch-hadoop expects either a
BytesWritable (preffered as it requires no
String conversion) object as output; if these types are not used, the library will simply fall back to the
toString representation of the target object.
Writable to use for JSON representation
use this when the JSON data is represented as a
use this if the JSON data is represented as a
make sure the
Make sure the data is properly encoded, in
UTF-8. The job output is considered the final form of the document sent to Elasticsearch.
JobConf conf = new JobConf(); conf.set("es.input.json", "yes"); conf.setMapOutputValueClass(BytesWritable.class); ... JobClient.runJob(conf);
Using the new is strikingly similar - in fact, the exact same class (
org.elasticsearch.hadoop.mr.EsOutputFormat) is used:
Configuration conf = new Configuration(); conf.setBoolean("mapred.map.tasks.speculative.execution", false); conf.setBoolean("mapred.reduce.tasks.speculative.execution", false); conf.set("es.resource", "radio/artists"); Job job = new Job(conf); job.setOutputFormat(EsOutputFormat.class); ... job.waitForCompletion(true);
As before, when dealing with JSON directly, under the new API the configuration looks as follows:
Configuration conf = new Configuration(); conf.set("es.input.json", "yes"); conf.setMapOutputValueClass(BytesWritable.class); ... JobClient.runJob(conf);
In a similar fashion, to read data from Elasticsearch, one needs to use
While it can read an entire index, it is much more convenient to actually execute a query and then feed the results back to Hadoop.
EsInputFormat returns a
Map<Writable, Writable> converted from the JSON documents returned by Elasticsearch and a null (to be ignored) key.
Following our example above on radio artists, to get a hold of all the artists that start with me, one could use the following snippet:
JobConf conf = new JobConf(); conf.set("es.resource", "radio/artists"); conf.set("es.query", "?q=me*"); conf.setInputFormat(EsInputFormat.class); ... JobClient.runJob(conf);
As expected, the
mapreduce API version is quite similar: