Apache Hive integrationedit

 

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

 
 -- Hive website

Hive abstracts Hadoop by abstracting it through SQL-like language, called HiveQL so that users can apply data defining and manipulating operations to it, just like with SQL. In Hive data set are defined through tables (that expose type information) in which data can be loaded, selected and transformed through built-in operators or custom/user defined functions (or UDFs).

Installationedit

Make elasticsearch-hadoop jar available in the Hive classpath. Depending on your options, there are various ways to achieve that. Use ADD command to add files, jars (what we want) or archives to the classpath:

ADD /path/elasticsearch-hadoop.jar;
Note

the command expects a proper URI that can be found either on the local file-system or remotely. Typically it’s best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed on various machines.

As an alternative, one can use the command-line:

CLI configuration. 

$ bin/hive -hiveconf hive.aux.jars.path=/path/elasticsearch-hadoop.jar

or if the hive-site.xml configuration can be modified, one can register additional jars through the hive.aux.jars.path option (that accepts an URI as well):

hive-site.xml configuration. 

<property>
  <name>hive.aux.jars.path</name>
  <value>/path/elasticsearch-hadoop.jar</value>
  <description>A comma separated list (with no spaces) of the jar files</description>
</property>

Configurationedit

When using Hive, one can use TBLPROPERTIES to specify the configuration properties (as an alternative to Hadoop Configuration object) when declaring the external table backed by Elasticsearch:

CREATE EXTERNAL TABLE artists (...)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists',
              'es.index.auto.create' = 'false') ;

elasticsearch-hadoop setting

Mappingedit

By default, elasticsearch-hadoop uses the Hive table schema to map the data in Elasticsearch, using both the field names and types in the process. There are cases however when the names in Hive cannot be used with Elasticsearch (the field name can contain characters accepted by Elasticsearch but not by Hive). For such cases, one can use the es.mapping.names setting which accepts a comma-separated list of names mapping in the following format: Hive field name:Elasticsearch field name

To wit:

CREATE EXTERNAL TABLE artists (...)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists',
            'es.mapping.names' = 'date:@timestamp , url:url_123 ');

name mapping for two fields

Hive column date mapped in Elasticsearch to @timestamp

Hive column url mapped in Elasticsearch to url_123

Tip

Elasticsearch accepts only lower-case field name and, as such, elasticsearch-hadoop will always convert Hive column names to lower-case. This poses no issue as Hive is case insensitive however it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names.

Writing data to Elasticsearchedit

With elasticsearch-hadoop, Elasticsearch becomes just an external table in which data can be loaded or read from:

CREATE EXTERNAL TABLE artists (
    id      BIGINT,
    name    STRING,
    links   STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists');

-- insert data to Elasticsearch from another table called 'source'
INSERT OVERWRITE TABLE artists
    SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture)
                    FROM source s;

Elasticsearch Hive StorageHandler

Elasticsearch resource (index and type) associated with the given storage

Writing existing JSON to Elasticsearchedit

For cases where the job output data is already in JSON, elasticsearch-hadoop allows direct indexing without applying any transformation; the data is taken as is and sent directly to Elasticsearch. In such cases, one needs to indicate the json input by setting the es.input.json parameter. As such, in this case elasticsearch-hadoop expects the output table to contain only one field, who s content is used as the JSON document. That is, the library will recognize specific textual types (such as string or binary) or simply call (toString).

Table 3. Hive types to use for JSON representation

Hive type Comment

binary

use this when the JSON data is represented as a byte[] or similar

string

use this if the JSON data is represented as a String

anything else

make sure the toString() returns the desired JSON document

Available in Hive 0.12 or higher

varchar

use this as an alternative to Hive string


Important

Make sure the data is properly encoded, in UTF-8. The field content is considered the final form of the document sent to Elasticsearch.

CREATE EXTERNAL TABLE json (data STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = '...',
              'es.input.json` = 'yes');
...

The table declaration only one field of type STRING

Indicate elasticsearch-hadoop the table content is in JSON format

Writing to dynamic/multi-resourcesedit

One can index the data to a different resource, depending on the row being read, by using patterns. Coming back to the aforementioned media example, one could configure it as follows:

CREATE EXTERNAL TABLE media (
    name    STRING,
    type    STRING,
    year    STRING,
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'my-collection/{type}');

Table field used by the resource pattern. Any of the declared fields can be used.

Resource pattern using field type

For each row about to be written, elasticsearch-hadoop will extract the type field and use its value to determine the target resource.

The functionality is also available when dealing with raw JSON - in this case, the value will be extracted from the JSON document itself. Assuming the JSON source contains documents with the following structure:

{
    "media_type":"music",
    "title":"Surfing With The Alien",
    "year":"1987"
}

field within the JSON document that will be used by the pattern

the table declaration can be as follows:

CREATE EXTERNAL TABLE json (data STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'my-collection/{media_type}',
              'es.input.json` = 'yes');

Schema declaration for the table. Since JSON input is used, the schema is simply a holder to the raw data

Resource pattern relying on fields within the JSON document and not on the table schema

Reading data from Elasticsearchedit

Reading from Elasticsearch is strikingly similar:

CREATE EXTERNAL TABLE artists (
    id      BIGINT,
    name    STRING,
    links   STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*');

-- stream data from Elasticsearch
SELECT * FROM artists;

same Elasticsearch Hive StorageHandler

Elasticsearch resource

Elasticsearch query

Type conversionedit

Important

If automatic index creation is used, please review this section for more information.

Hive provides various types for defining data and internally uses different implementations depending on the target environment (from JDK native types to binary-optimized ones). Elasticsearch integrates with all of them, including and Serde2 lazy and lazy binary:

Hive type Elasticsearch type

void

null

boolean

boolean

tinyint

byte

smallint

short

int

int

bigint

long

double

double

float

float

string

string

binary

binary

timestamp

date

struct

map

map

map

array

array

union

not supported (yet)

Available in Hive 0.11 or higher

decimal

string

Available in Hive 0.12 or higher

date

date

varchar

string

Note

While Elasticsearch understands Hive types up to version 0.12, it is backwards compatible with Hive 0.9