elasticsearch. tutorials

Attachment Type in Action

By Lukas Vlcek | 18 Jul 2011

This tutorial will walk you through basic attachment type setup and use in search including highlighting.

Installation

First we need to install the attachments plugin, follow the instructions listed here.

Make sure you restart ElasticSearch, so the plugins are picked up.

Download some data

Let’s download some PDF document:

curl -C - -O http://www.intersil.com/data/fn/fn6742.pdf

Attachments plugin can parse and index documents in many formats. Check Tika web page for list of supported formats.

Setup mapping

Prepare new index for the data.

curl -X DELETE "localhost:9200/test"

curl -X PUT "localhost:9200/test" -d '{
  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}
}'

Before we can use the attachments plugin we need to create correct mapping for the attachment type.

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
  "attachment" : {
    "properties" : {
      "file" : {
        "type" : "attachment",
        "fields" : {
          "title" : { "store" : "yes" },
          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }
        }
      }
    }
  }
}'

Indexing the Data

We are ready to index the data. We just need to encode the content of the file with Base64 for which we will use a simpe Perl script.

#!/bin/sh

coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
curl -X POST "localhost:9200/test/attachment/" -d @json.file

Search for Highlighted results

We are done indexing the data and we are now free to search it. To make it even more cool we want the document title and some highlighted results back (note how we setup mapping for the title and file content).

curl "localhost:9200/_search?pretty=true" -d '{
  "fields" : ["title"],
  "query" : {
    "query_string" : {
      "query" : "amplifier"
    }
  },
  "highlight" : {
    "fields" : {
      "file" : {}
    }
  }
}'

This will produce the following result:

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.005872132,
    "hits" : [ {
      "_index" : "test",
      "_type" : "attachment",
      "_id" : "UUaHJ6CfTOC3T2I4Kj_pXg",
      "_score" : 0.005872132,
      "fields" : {
        "file.title" : "ISL99201"
      },
      "highlight" : {
        "file" : [ "\nMono <em>Amplifier</em> • Filterless Class D with Efficiency > 86% at 400mW\nThe ISL99201 is a fully integrat", "\nmono <em>amplifier</em>. It is designed to maximize performance for \nmobile phone applications. The applicat" ]
      }
    } ]
  }
}

Want to try yourself?

Go grab this script and try yourself.

blog comments powered by Disqus
 
Fork me on GitHub