Averages can be misleading: try a percentile


With the release of Elasticsearch 1.1.0, there is a new metric aggregation available to users: the Percentile metric. Percentiles tell you the value at which a certain percentage of your data is included. So a 95th percentile tells you the value which is greater than or equal to 95% of your data.

Ok…but why is that useful?

Imagine you are the administrator for a large website. One of your goals is to guarantee fast response times to all website visitors, no matter where in the world they live. How do you analyze your data to guarantee that the latency is small?

Most people reach for basic statistics like mean, median or max. Each have their place, but for populations of data they often hide the truth. Mean and median tend to hide outliers, since the majority of your data is “normal”. In contrast, the max is a hypercondriac and easily distorted by a single outlier.

Let’s look at a graph. If you rely on simple metrics like mean or median, you might see a graph that looks like this:

Mean + Median

That doesn’t look so bad, does it? Average and median response time is around 50ms, and creeps up to 100ms for a little while. A different truth is apparent when you include the 99th percentile:

99th Percentile

Wow! That certainly doesn’t look good at all! At 9:30am, the mean is telling you “Don’t worry, the average latency is only 75ms”. In contrast, the 99th percentile says “99% of your values are less than 850ms”, which is a very different picture. One percent of all your customers are experiencing 800+ ms latencies, which could be very bad for business.

Using the percentile

The new percentile metric works just like the simpler stats metrics like min and avg. It is a metric that can be applied to any aggregation bucket. The percentile metric will then calculate a set of percentiles based on the documents that fall within the bucket. Let’s look at a simple example:

curl -XGET localhost:9200/website/logs/_search -d '
{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time" 
            }
        }
    }
}'

By default, the percentiles metric will calculate a set of default percentiles ([ 1, 5, 25, 50, 75, 95, 99 ]) and return you the value for each one:

{
    ...

   "aggregations": {
      "load_time_outlier": {
         "1.0": 15,
         "5.0": 20,
         "25.0": 33,
         "50.0": 38,
         "75.0": 45,
         "95.0": 60,
         "99.0": 867
      }
   }
}

Often, only the extreme percentiles are important to you, such as the 95th and 99.9th percentile. In this case, you can specify just the percentile you are interested in:

curl -XGET localhost:9200/website/logs/_search -d '
{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time" ,
                "percents" : [ 95, 99.9 ]
            }
        }
    }
}'

Being a metric, we can nest it inside of buckets to get more sophisticated analysis. Going back to our original goal — detecting high latency based on geographical location — we can build an aggregation that buckets users by their country and then computes a percentile on load_time

curl -XGET localhost:9200/website/logs/_search -d '
{
    "aggs" : {
        "countries" : {
            "terms" : {
                "field" : "country_code"   
            },
            "aggs" : {
                "load_time_outlier" : {
                    "percentiles" : {
                        "field" : "load_time" ,
                        "percents" : [ 95 ]
                    }
                }    
            }
        }
    }
}'

And now we can see that Antarctica has a particularly slow 95th percentile (for some strange reason):

{
    ...

   "aggregations": {
       "country" : {
           "buckets": [
                {
                    "key" : "AY",
                    "doc_count" : 20391,
                    "load_time_outlier": {
                         "95.0": 1205
                    }
                },
                ...

percentiles are (usually) approximate

All good things come at a price, and with percentiles it usually boils down to approximations. Fundamentally, percentiles are very expensive to calculate. If you want to calculate the 95th percentile, you need to sort all your values from least to greatest, then find the value at myArray[ count(myArray) * 0.95]

This works fine for small data that fits in memory, but simply fails when you have terrabytes of data spread over a cluster of servers (which is common for Elasticsearch users). The exact method just won't work for Elasticsearch.

Instead, we use an algorithm called T-Digest (you can read more about it here). Without getting bogged down in technical details, it is sufficient to make the following claims about T-Digest:

  • For small datasets, your percentiles will be highly accurate (potentially 100% exact if the data is small enough)
  • For larger datasets, T-Digest will begin to trade accuracy for memory savings so that your node doesn't explode
  • Extreme percentiles (e.g. 95th) tend to be more accurate than interior percentiles (e.g. 50th)

The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:

percentiles_error

The absolute error of a percentile is the actual value minus the approximate value. It is often useful to express that as a relative percentage rather than in absolute difference. In the chart, we can see that at 1000 values, the 50th percentile is 0.26% off the true 50th percentile. In absolute terms, if the true 50th was 100ms, T-Digest might have told us 100.26ms. Practically speaking, the error is often negligible, especially when you are looking at the more extreme percentiles

The chart also shows how precision is as you add more data. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.

The memory-vs-accuracy tradeoff is configurable via a compression parameter, which you can find more details about in the documentation.

Conclusion

Now armed with some basic knowledge about percentiles, hopefully you are beginning to see applications all over your data. These approximate algorithms are exciting new territory for Elasticsearch. We look forward to your feedback on the mailing list or Twitter!