Load testing Shared Elasticsearch¶

One of the prerequisites when implementing Shared ElasticSearch (ES) in Grove was to confirm if it worked for OpenCraft's use-case. To wit, "Can we run a large amount of instances, while limiting resource usage?".

Prepping the test¶

To carry out the load tests, we made use of both Locust and K6. Initially, we relied on Locust for our load testing needs, but we encountered challenges when attempting to generate a significant load on ElasticSearch. Locust required multiple workers, and despite our efforts, the results were unreliable. Consequently, we opted to switch to K6 for improved performance and consistency.

The locust results are not included here for this reason.

Secondly, we needed some data to query which was provided by Shakespeare dataset. Importing the data is fairly simple, using the bulk import endpoint on any of the ES nodes.

curl -XPOST --insecure \
    -u elastic:$ELASTIC_PASSWORD \
    -H "Content-type: application/json" \
    --data-binary @shakespeare.json \
    "https://localhost:9200/shakespeare/_bulk?pretty"

Running the load test¶

The test consisted of 2000 users running a search query on the cluster for our imported shakespeare index. Each test will run for 1 minute in total and we'll compare the number of successful requests handled by our ES cluster vs the built-in ES deployment provided by tutor.

Our Digital Kubernetes cluster consisted of three nodes, each with 4 CPU Cores and 8GB of memory.

Running the testcluster-test.jspod-test.jsUbuntu pod setup

$ kubectl exec -it -nelasticsearch ubuntu-shell -- \
    k6 run /cluster.js --vus 2000 --duration 1m

import http from "k6/http";
import encoding from "k6/encoding";

const username = "elastic";
const password = "";
const host = "https://elasticsearch-master.elasticsearch.svc.cluster.local:9200";

export const options = {
  insecureSkipTLSVerify: true,
};

const encodedCredentials = encoding.b64encode(`${username}:${password}`);
options.headers = {
  Authorization: `Basic ${encodedCredentials}`,
};

export default function () {
  const maxRecords = 10000;
  const start = Math.floor(Math.random() * maxRecords);
  let size = Math.floor(Math.random() * 500);

  if (start + size > maxRecords) {
    size = maxRecords - start;
  }

  const url = `${host}/shakespeare/_search?from=${start}&size=${size}`;
  return http.get(url, options);
}

import http from "k6/http";
import encoding from "k6/encoding";

const host = "http://elasticsearch:9200";

export default function () {
  const maxRecords = 10000;
  const start = Math.floor(Math.random() * maxRecords);
  let size = Math.floor(Math.random() * 500);

  if (start + size > maxRecords) {
    size = maxRecords - start;
  }

  const url = `${host}/shakespeare/_search?from=${start}&size=${size}`;
  return http.get(url);
}

$ kubectl run ubuntu-shell -nelasticsearch --image ubuntu -- sleep 365d
$ kubectl exec -it -nelasticsearch ubuntu-shell -- bash
# apt update && \
    apt install ca-certificates gpg -y &&  \
    mkdir -p ~/.gnupg && \
    gpg --no-default-keyring \
        --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
        --keyserver hkp://keyserver.ubuntu.com:80 \
        --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69 && \
    echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] \
          https://dl.k6.io/deb stable main" | \
          tee /etc/apt/sources.list.d/k6.list &&
    apt-get update && \
    apt-get install -y k6

Running the test¶

Let's start with the result.

Helm Chart Default SettingsTutor supplied ES

data_received..................: 657 MB 10 MB/s
data_sent......................: 3.5 MB 55 kB/s
http_req_blocked...............: avg=745.88ms min=0s       med=3.54µs   max=57.67s  p(90)=16.43µs p(95)=3.13s
http_req_connecting............: avg=39.85ms  min=0s       med=0s       max=1.17s   p(90)=80.71ms p(95)=107.98ms
http_req_duration..............: avg=4.13s    min=0s       med=3.18s    max=58.5s   p(90)=7.06s   p(95)=8.86s
  { expected_response:true }...: avg=4.45s    min=190.83ms med=3.38s    max=58.5s   p(90)=7.31s   p(95)=9.1s
http_req_failed................: 9.37%  ✓ 1095       ✗ 10584
http_req_receiving.............: avg=10.05ms  min=0s       med=673.28µs max=4.59s   p(90)=3.4ms   p(95)=6.49ms
http_req_sending...............: avg=34.87µs  min=0s       med=18.64µs  max=31.36ms p(90)=52.34µs p(95)=77.01µs
http_req_tls_handshaking.......: avg=709.98ms min=0s       med=0s       max=57.29s  p(90)=0s      p(95)=2.72s
http_req_waiting...............: avg=4.12s    min=0s       med=3.18s    max=58.5s   p(90)=7.04s   p(95)=8.86s
http_reqs......................: 11679  184.290917/s
iteration_duration.............: avg=10.37s   min=203.28ms med=3.77s    max=1m0s    p(90)=45.75s  p(95)=1m0s
iterations.....................: 11679  184.290917/s
vus............................: 22     min=22       max=2000
vus_max........................: 2000   min=2000     max=2000

data_received..................: 2.4 GB 40 MB/s
data_sent......................: 4.9 MB 81 kB/s
http_req_blocked...............: avg=18.31ms  min=841ns    med=1.84µs   max=585.46ms p(90)=4.23µs   p(95)=90.09µs
http_req_connecting............: avg=1.06ms   min=0s       med=0s       max=166.73ms p(90)=0s       p(95)=0s
http_req_duration..............: avg=2.94s    min=197.98ms med=530.51ms max=59.99s   p(90)=966.3ms  p(95)=1.27s
  { expected_response:true }...: avg=691.35ms min=197.98ms med=525.98ms max=59.23s   p(90)=909.15ms p(95)=991.66ms
http_req_failed................: 3.83%  ✓ 1551       ✗ 38909
http_req_receiving.............: avg=1.15ms   min=0s       med=74.72µs  max=1.5s     p(90)=166.92µs p(95)=622.84µs
http_req_sending...............: avg=172.06µs min=4.56µs   med=8.73µs   max=64.83ms  p(90)=22.6µs   p(95)=157.4µs
http_req_tls_handshaking.......: avg=0s       min=0s       med=0s       max=0s       p(90)=0s       p(95)=0s
http_req_waiting...............: avg=2.94s    min=197.92ms med=529.85ms max=59.99s   p(90)=964.77ms p(95)=1.26s
http_reqs......................: 40460  668.607358/s
iteration_duration.............: avg=2.96s    min=198.08ms med=531.05ms max=1m0s     p(90)=966.76ms p(95)=1.39s
iterations.....................: 40460  668.607358/s
vus............................: 1000   min=1000     max=2000
vus_max........................: 2000   min=2000     max=2000

Of interest is the total number of requests: 38k in total for the default deployment vs 10k for the cluster.

It is important to note that the default deployment offers significantly greater throughput in comparison to the clustered version. The primary reason for this discrepancy can be attributed to the restrictive limits that are imposed by the helm chart. Moreover, factors such as SSL overhead and internal cluster communication contribute to further reductions in throughput for the clustered version.

Improving the cluster's performance¶

Since the Shared ElasticSearch cluster needs to handle multiple instance, it needs to have performance at least on par with a single node.

In short, we changed the following settings from their defaults.

resources:
  limits:
    cpu: "2000m"
    memory: "4Gi"

threadpool:
    search:
        size: 5000

CPU limits as well as memory limits are doubled. We recommend doubling them again if you run a larger instance. The thread pool for search is also increased to 5000 from the default of 1000. This will allow the cluster to handle more users at once (at the expense of memory).

The load test for the cluster now looks like this:

data_received..................: 1.4 GB 22 MB/s
data_sent......................: 6.4 MB 102 kB/s
http_req_blocked...............: avg=233.87ms min=0s      med=3.7µs    max=59.38s p(90)=7.09µs  p(95)=153.51µs
http_req_connecting............: avg=26.13ms  min=0s      med=0s       max=1.22s  p(90)=0s      p(95)=92.61ms
http_req_duration..............: avg=2.73s    min=0s      med=2.03s    max=58.63s p(90)=4.37s   p(95)=5.59s
  { expected_response:true }...: avg=2.81s    min=14.13ms med=2.09s    max=58.4s  p(90)=4.41s   p(95)=5.67s
http_req_failed................: 4.09%  ✓ 923        ✗ 21618
http_req_receiving.............: avg=7.56ms   min=0s      med=860.68µs max=3.31s  p(90)=3.94ms  p(95)=6.32ms
http_req_sending...............: avg=31.44µs  min=0s      med=19.12µs  max=4.05ms p(90)=47.34µs p(95)=80.27µs
http_req_tls_handshaking.......: avg=207.73ms min=0s      med=0s       max=58.89s p(90)=0s      p(95)=0s
http_req_waiting...............: avg=2.73s    min=0s      med=2.02s    max=58.63s p(90)=4.36s   p(95)=5.59s
http_reqs......................: 22541  361.528155/s
iteration_duration.............: avg=5.38s    min=14.45ms med=2.17s    max=1m0s   p(90)=6.61s   p(95)=33.83s
iterations.....................: 22541  361.528155/s
vus............................: 28     min=28       max=2000
vus_max........................: 2000   min=2000     max=2000

Conclusion¶

The most surprising result of these tests is that just adding more nodes to an ES cluster doesn't result in any increase in throughput.

Allowing the cluster to use more CPU had the greatest effect.
Running with only a single replica is risky, but not out of the question.
A cluster with more than 3 nodes will likely be overkill.