Docker - ELK 7.6 : Elastic Stack with Docker Compose

bogotobogo.com site search:

Note

Elastic Stack docker/kubernetes series:

Docker - ELK 7.6 : Elasticsearch

Docker - ELK 7.6 : Filebeat

Docker - ELK 7.6 : Logstash (All in One)

Docker - ELK 7.6 : Kibana

Docker - ELK 7.6 : Kibana II

Docker - ELK 7.6 : Elastic Stack with Docker Compose

Docker - Deploy Elastic Cloud on Kubernetes (ECK) via Elasticsearch operator on minikube

Docker - Deploy Elastic Stack via Helm on minikube

Running the Elastic Stack with Docker Compose

Git repo Elasticsearch stack (ELK) with docker-compose.

docker-compose.yml

The following docker-compose.yml brings up a elasticseach, logstash and Kibana containers so we can see how things work. This all-in-one configuration is a handy way to bring up our first dev cluster before we build a distributed deployment with multiple hosts:

version: '3.7'

services:
  elasticsearch:
    build:
      context: elasticsearch/
      args:
        ELK_VERSION: $ELK_VERSION
    volumes:
      - type: bind
        source: ./elasticsearch/config/elasticsearch.yml
        target: /usr/share/elasticsearch/config/elasticsearch.yml
        read_only: true
      - type: volume
        source: elasticsearch
        target: /usr/share/elasticsearch/data
    ports:
      - "9200:9200"
      - "9300:9300"
    environment:
      ES_JAVA_OPTS: "-Xmx256m -Xms256m"
      ELASTIC_PASSWORD: changeme
      # Use single node discovery in order to disable production mode and avoid bootstrap checks
      # see https://www.elastic.co/guide/en/elasticsearch/reference/current/bootstrap-checks.html
      discovery.type: single-node
    networks:
      - elk

  logstash:
    build:
      context: logstash/
      args:
        ELK_VERSION: $ELK_VERSION
    volumes:
      - type: bind
        source: ./logstash/config/logstash.yml
        target: /usr/share/logstash/config/logstash.yml
        read_only: true
      - type: bind
        source: ./logstash/pipeline
        target: /usr/share/logstash/pipeline
        read_only: true
    ports:
      - "5000:5000/tcp"
      - "5000:5000/udp"
      - "9600:9600"
    environment:
      LS_JAVA_OPTS: "-Xmx256m -Xms256m"
    networks:
      - elk
    depends_on:
      - elasticsearch

  kibana:
    build:
      context: kibana/
      args:
        ELK_VERSION: $ELK_VERSION
    volumes:
      - type: bind
        source: ./kibana/config/kibana.yml
        target: /usr/share/kibana/config/kibana.yml
        read_only: true
    ports:
      - "5601:5601"
    networks:
      - elk
    depends_on:
      - elasticsearch

networks:
  elk:
    driver: bridge

volumes:
  elasticsearch:

Run docker-compose

Run docker-compose to bring up the three-node Elasticsearch cluster and Kibana:

$ docker-compose up
Creating network "docker-elk_elk" with driver "bridge"
Creating docker-elk_elasticsearch_1 ... done
Creating docker-elk_kibana_1        ... done
Creating docker-elk_logstash_1      ... done
Attaching to docker-elk_elasticsearch_1, docker-elk_kibana_1, docker-elk_logstash_1
logstash_1       | OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
elasticsearch_1  | Created elasticsearch keystore in /usr/share/elasticsearch/config
...
elasticsearch_1  | OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
elasticsearch_1  | {"type": "server", "timestamp": "2020-04-02T17:34:33,288Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "docker-cluster", "node.name": "9097740a0d56", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/sda1)]], net usable_space [5.4gb], net total_space [58.4gb], types [ext4]" }
...
logstash_1       | [2020-04-02T17:35:44,439][INFO ][logstash.licensechecker.licensereader] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://elastic:xxxxxx@elasticsearch:9200/]}}
logstash_1       | [2020-04-02T17:35:46,084][WARN ][logstash.licensechecker.licensereader] Restored connection to ES instance {:url=>"http://elastic:xxxxxx@elasticsearch:9200/"}
...
elasticsearch_1  | {"type": "server", "timestamp": "2020-04-02T17:35:51,857Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "docker-cluster", "node.name": "9097740a0d56", "message": "high disk watermark [90%] exceeded on [AxHQR8ZNRy6JultUbzBtVg][9097740a0d56][/usr/share/elasticsearch/data/nodes/0] free: 5.4gb[9.2%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete", "cluster.uuid": "591mkpKPTqmIoyaiUNsc2g", "node.id": "AxHQR8ZNRy6JultUbzBtVg"  }
kibana_1         | {"type":"log","@timestamp":"2020-04-02T17:35:52Z","tags":["warning","config","deprecation"],"pid":6,"message":"Setting [elasticsearch.username] to \"elastic\" is deprecated. You should use the \"kibana\" user instead."}
...
kibana_1         | {"type":"log","@timestamp":"2020-04-02T17:36:10Z","tags":["info","http","server","Kibana"],"pid":6,"message":"http server running at http://0:5601"}

Once we see from the output similar to this, we know Kibana is ready:

kibana_1         | {"type":"log","@timestamp":"2020-04-02T04:31:03Z","tags":["info","http","server","Kibana"],"pid":6,"message":"http server running at http://0:5601"}

Once everything appears to be OK, we may want to stop it and run the Elastic stack in detached mode using docker-compose up -d.

Kibana

Type in browser: localhost:5601 and login with "elastic/changeme":

Now, our ELK stack is running:

$ docker-compose ps
                          Name                                         Command               State                                        Ports                                      
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
einsteinish-elk-stack-with-docker-compose_elasticsearch_1   /usr/local/bin/docker-entr ...   Up      0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp                                  
einsteinish-elk-stack-with-docker-compose_kibana_1          /usr/local/bin/dumb-init - ...   Up      0.0.0.0:5601->5601/tcp                                                          
einsteinish-elk-stack-with-docker-compose_logstash_1        /usr/local/bin/docker-entr ...   Up      0.0.0.0:5000->5000/tcp, 0.0.0.0:5000->5000/udp, 5044/tcp, 0.0.0.0:9600->9600/tcp

We can check how much resources are consumed by our containers using docker stats:

$ docker stats
CONTAINER ID        NAME                                                        CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
af9519caa50f        einsteinish-elk-stack-with-docker-compose_logstash_1        2.81%               466.5MiB / 1.945GiB   23.43%              48.7kB / 216kB      0B / 0B             40
2b9c23122e48        einsteinish-elk-stack-with-docker-compose_kibana_1          1.16%               288MiB / 1.945GiB     14.46%              901kB / 11.7MB      0B / 0B             12
1704e0b35296        einsteinish-elk-stack-with-docker-compose_elasticsearch_1   1.43%               545.7MiB / 1.945GiB   27.40%              1.54MB / 609kB      0B / 0B             63

Injecting logs and creating an index

Our Elastic stack is ready, however, we haven't created any index pattern. Before doing that, let's inject some log entries.

The shipped Logstash configuration (/logstash/pipeline/logstash.conf) allows us to send content via TCP:

input {
	tcp {
		port => 5000
	}
}

output {
	elasticsearch {
		hosts => "elasticsearch:9200"
		user => "elastic"
		password => "changeme"
	}
}

$ lsof -nP -iTCP:5000
\COMMAND     PID    USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
com.docke 33800 ki.hong   35u  IPv4 0xd2b8428b995337cb      0t0  TCP *:5000 (LISTEN)
com.docke 33800 ki.hong   36u  IPv6 0xd2b8428ba27692eb      0t0  TCP [::1]:5000 (LISTEN)

$ cat /tmp/logstash-tutorial.log |nc -c localhost 5000

We can get the logstash-tutorial.log.gz from here.

Now that we have logs, after creating an index, we get some display on the Kibana.

In Discover, we have access to every document in every index that matches the selected index pattern. The index pattern tells Kibana which Elasticsearch index we are currently exploring. We can submit search queries, filter the search results, and view document data.

Here are the steps:

In the side navigation, click Discover. Ensure "logstash*" is the current index pattern.

We can see a histogram that shows the distribution of documents over time. A table lists the fields for each matching document. By default, all fields are shown.

To choose which fields to display, hover the pointer over the list of Available fields, and then click add next to each field we want include as a column in the table. For example, if we add the message and DestWeather and @timestamp fields, the display includes columns for those two fields.

Elasticsearch Query Samples from elastic.co

Here, we'll play with queries.

Before we do that, let's make sure the setup for xpack in elasticsearch/config/elasticsearch.yml is set as the following:

xpack.license.self_generated.type: basic
xpack.security.enabled: true
xpack.monitoring.collection.enabled: true

Otherwise, we may have the following error because a request without authentication credentials which throws this Exception as security is now enabled.

 {"error":{"root_cause":[{"type":"security_exception","reason":"missing authentication credentials for REST request [/]","header":{"WWW-Authenticate":"Basic realm=\"security\" charset=\"UTF-8\""}}],"type":"security_exception","reason":"missing authentication credentials for REST request [/]","header":{"WWW-Authenticate":"Basic realm=\"security\" charset=\"UTF-8\""}},"status":401}

After the change, we need to restart the stack.

We'll start to using cat APIs which are only intended for human consumption using the Kibana console or command line.

On a Browser, then we'd be prompted for authentication (Basic Authentication) . If we're using i.e. curl, we need to pass the credentials with our request with -u username:password (elastic:changeme).

Not recommended, we can disable the security by setting it false:

xpack.license.self_generated.type: basic
xpack.security.enabled: false
xpack.monitoring.collection.enabled: true

To list all available commands:

curl:

$ curl -X GET "localhost:9200/_cat" -u elastic:changeme
=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/tasks
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates

browser:

Each of the _cat commands accepts a query string parameter v to turn on verbose output. For example, on Kibana console with Dev Tools:

$ curl -X GET "localhost:9200/_cat/nodes?"
192.168.96.2 60 92 5 0.35 0.32 0.41 dilm * 8b73d9076e68

h query string parameter which forces only those columns to appear:

$ curl -X GET "localhost:9200/_cat/nodes?h=ip,port,heapPercent,name&pretty"
192.168.96.2 9300 67 8b73d9076e68

We can also request multiple columns using simple wildcards like /_cat/thread_pool?h=ip,queue* to get all headers (or aliases) starting with queue.

$ curl -X GET "localhost:9200/_cat/thread_pool?h=ip,queue*"
192.168.96.2 0   16
192.168.96.2 0  100
...
192.168.96.2 0   -1
192.168.96.2 0    4
192.168.96.2 0   -1
192.168.96.2 0 1000
192.168.96.2 0  200

If we want to find the largest index in our cluster (storage used by all the shards, not number of documents). The /_cat/indices API is ideal. We only need to add three things to the API request:

The bytes query string parameter with a value of b to get byte-level resolution.
The s (sort) parameter with a value of store.size:desc to sort the output by shard storage in descending order.
The v (verbose) parameter to include column headings in the response.

$ curl -X GET "localhost:9200/_cat/indices?bytes=b&s=store.size:desc&v"
health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .monitoring-es-7-2020.04.06       XQSHRxs7RsOZ3ZeLsa0Y_Q   1   0      77410        55160   34506536       34506536
green  open   .monitoring-es-7-2020.04.10       Gjs4h4dqTIWKC3owN0cjqQ   1   0       9435         1443   10628315       10628315
green  open   .monitoring-logstash-7-2020.04.06 6itFo78lShiWIb1e5i6GKg   1   0      43969            0    3007501        3007501
green  open   .monitoring-kibana-7-2020.04.06   LWbl13UVQLq_cWwkCccatA   1   0       5522            0    1220685        1220685
green  open   .monitoring-logstash-7-2020.04.10 myGRrPNMRlebzYfOdBbHUQ   1   0       2577            0     542115         542115
green  open   .monitoring-kibana-7-2020.04.10   knd52K_vSTellI_qhGPYpA   1   0        544            0     269953         269953
green  open   .security-7                       e21FT4JoQ2WML_oax2GFYA   1   0         36            0      99098          99098
green  open   .kibana_1                         2yJ-CzinQ-Czv0H3Rg4mQg   1   0         10            1      39590          39590
yellow open   logstash-2020.04.06-000001        YShJ9NKUQO-4TuwhS0MlXA   1   1        100            0      36727          36727
green  open   ilm-history-1-000001              rVfV3nLQSXOM7c6yN68dbg   1   0         18            0      32919          32919
green  open   .kibana_task_manager_1            y29CTX98TEuZt3pb6lnXhA   1   0          2            0       6823           6823
green  open   .apm-agent-configuration          zC5fg2AhSVK0TvV3WUcv_Q   1   0          0            0        283            283

The following queries give the same response in json format:

$ curl 'localhost:9200/_cat/indices?format=json&pretty'
[
  {
    "health" : "green",
    "status" : "open",
    "index" : ".security-7",
    "uuid" : "e21FT4JoQ2WML_oax2GFYA",
    "pri" : "1",
    "rep" : "0",
    "docs.count" : "36",
    "docs.deleted" : "0",
    "store.size" : "96.7kb",
    "pri.store.size" : "96.7kb"
  },
  ...
  
$ curl 'localhost:9200/_cat/indices?pretty' -H "Accept: application/json"
[
  {
    "health" : "green",
    "status" : "open",
    "index" : ".security-7",
    "uuid" : "e21FT4JoQ2WML_oax2GFYA",
    "pri" : "1",
    "rep" : "0",
    "docs.count" : "36",
    "docs.deleted" : "0",
    "store.size" : "96.7kb",
    "pri.store.size" : "96.7kb"
  },

s query string parameter which sorts the table by the columns specified as the parameter value. Columns are specified either by name or by alias, and are provided as a comma separated string. By default, sorting is done in ascending fashion. Appending :desc to a column will invert the ordering for that column. :asc is also accepted but exhibits the same behavior as the default sort order.

For example, with a sort string s=column1,column2:desc,column3, the table will be sorted in ascending order by column1, in descending order by column2, and in ascending order by column3.

Let's put JSON documents into an Elasticsearch index.

We can do this directly with a simple PUT request that specifies the index we want to add the document, a unique document ID, and one or more "field": "value" pairs in the request body:

$ curl -X PUT "localhost:9200/customer/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "name": "John Doe"
}
'
{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

This request automatically creates the customer index if it doesn’t already exist, adds a new document that has an ID of 1, and stores and indexes the name field.

The new document is available immediately from any node in the cluster. We can retrieve it with a GET request that specifies its document ID:

$ curl -X GET "localhost:9200/customer/_doc/1?pretty"
{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "John Doe"
  }
}

If we have a lot of documents to index, we can submit them in batches with the https://www.elastic.co/guide/en/elasticsearch/reference/7.6/docs-bulk.htmlbulk API.

Let's download the accounts.json sample data set:

$ curl -L https://github.com/elastic/elasticsearch/blob/master/docs/src/test/resources/accounts.json?raw=true -o accounts.json

The data is randomly-generated data set represent user accounts with the following information (Index some documents):

{
    "account_number": 0,
    "balance": 16623,
    "firstname": "Bradshaw",
    "lastname": "Mckenzie",
    "age": 29,
    "gender": "F",
    "address": "244 Columbus Place",
    "employer": "Euron",
    "email": "bradshawmckenzie@euron.com",
    "city": "Hobucken",
    "state": "CO"
}

We're going to index the account data into the bank index with the following _bulk request:

$ curl -H "Content-Type: application/json" -XPOST \
"localhost:9200/bank/_bulk?pretty&refresh" \
--data-binary "@accounts.json"  
{
  "took" : 913,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 1,
        "result" : "created",
        "forced_refresh" : true,
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 0,
        "_primary_term" : 1,
        "status" : 201
      }
    },
...
    {
      "index" : {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "995",
        "_version" : 1,
        "result" : "created",
        "forced_refresh" : true,
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 999,
        "_primary_term" : 1,
        "status" : 201
      }
    }
  ]
}

The --data-binary posts data exactly as specified with no extra processing whatsoever while --data or -d sends the specified data in a POST request to the HTTP server, in the same way that a browser does when a user has filled in an HTML form and presses the submit button. This will cause curl to pass the data to the server using the content-type application/x-www-form-urlencoded.

We can check if the 1,000 documents were indexed successfully:

$ curl -X GET "localhost:9200/_cat/indices?v"
health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .apm-agent-configuration          zC5fg2AhSVK0TvV3WUcv_Q   1   0          0            0       283b           283b
green  open   .kibana_1                         2yJ-CzinQ-Czv0H3Rg4mQg   1   0         10            1     38.6kb         38.6kb
green  open   .kibana_task_manager_1            y29CTX98TEuZt3pb6lnXhA   1   0          2            0       32kb           32kb
green  open   .monitoring-es-7-2020.04.06       XQSHRxs7RsOZ3ZeLsa0Y_Q   1   0      77410        55160     32.9mb         32.9mb
green  open   .monitoring-es-7-2020.04.10       Gjs4h4dqTIWKC3owN0cjqQ   1   0      30096            0     10.7mb         10.7mb
green  open   .monitoring-es-7-2020.04.11       FMSpb4JKScGYhu8nEzXt1A   1   0         95           18    695.5kb        695.5kb
green  open   .monitoring-kibana-7-2020.04.06   LWbl13UVQLq_cWwkCccatA   1   0       5522            0      1.1mb          1.1mb
green  open   .monitoring-kibana-7-2020.04.10   knd52K_vSTellI_qhGPYpA   1   0       1752            0    534.6kb        534.6kb
green  open   .monitoring-kibana-7-2020.04.11   GxM1BDKvRkGTv0gWyt8U_A   1   0          3            0     42.9kb         42.9kb
green  open   .monitoring-logstash-7-2020.04.06 6itFo78lShiWIb1e5i6GKg   1   0      43969            0      2.8mb          2.8mb
green  open   .monitoring-logstash-7-2020.04.10 myGRrPNMRlebzYfOdBbHUQ   1   0       8617            0      1.1mb          1.1mb
green  open   .monitoring-logstash-7-2020.04.11 TAjxbas0Rd-5mb2JOmvr0A   1   0         15            0     95.5kb         95.5kb
green  open   .security-7                       e21FT4JoQ2WML_oax2GFYA   1   0         36            0     96.7kb         96.7kb
yellow open   bank                              bDhhObs0SMiHpPJti21rZA   1   1       1000            0    414.1kb        414.1kb
yellow open   customer                          Q68qN_NBSOqz3dnWG6P0yQ   1   1          1            0      3.4kb          3.4kb
green  open   ilm-history-1-000001              rVfV3nLQSXOM7c6yN68dbg   1   0         18            0     32.1kb         32.1kb
yellow open   logstash-2020.04.06-000001        YShJ9NKUQO-4TuwhS0MlXA   1   1        100            0     35.8kb         35.8kb

Just to see the bank index:

$ curl -X GET localhost:9200/_cat/indices/bank
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank  bDhhObs0SMiHpPJti21rZA   1   1       1000            0    414.1kb        414.1kb

Elasticsearch Search Samples from elastic.co

The following section is based on Elasticsearch Reference [7.9] » Getting started with Elasticsearch » Start searching.

Now that we have ingested some data into an Elasticsearch index, we can search it by sending requests to the _search endpoint. To access the full suite of search capabilities, we use the Elasticsearch Query DSL to specify the search criteria in the request body. We specify the name of the index we want to search in the request URI.

The following request, for example, retrieves all documents in the bank index sorted by account number:

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}
' 
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}
'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "0",
        "_score" : null,
        "_source" : {
          "account_number" : 0,
          "balance" : 16623,
          "firstname" : "Bradshaw",
          "lastname" : "Mckenzie",
          "age" : 29,
          "gender" : "F",
          "address" : "244 Columbus Place",
          "employer" : "Euron",
          "email" : "bradshawmckenzie@euron.com",
          "city" : "Hobucken",
          "state" : "CO"
        },
        "sort" : [
          0
        ]
      },
...
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "9",
        "_score" : null,
        "_source" : {
          "account_number" : 9,
          "balance" : 24776,
          "firstname" : "Opal",
          "lastname" : "Meadows",
          "age" : 39,
          "gender" : "M",
          "address" : "963 Neptune Avenue",
          "employer" : "Cedward",
          "email" : "opalmeadows@cedward.com",
          "city" : "Olney",
          "state" : "OH"
        },
        "sort" : [
          9
        ]
      }
    ]
  }
}

As we can see from the output above, by default, the hits section of the response includes the first 10 documents that match the search criteria.

The response also provides the following information about the search request:

took – how long it took Elasticsearch to run the query, in milliseconds
timed_out – whether or not the search request timed out
_shards – how many shards were searched and a breakdown of how many shards succeeded, failed, or were skipped.
hits.total.value - how many matching documents were found
hits.max_score – the score of the most relevant document found
hits.sort - the document’s sort position (when not sorting by relevance score)
hits._score - the document’s relevance score (not applicable when using match_all)

To page through the search hits, specify the from and size parameters in our request. For example, the following request gets hits 10 through 12:

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ],
  "from": 10,
  "size": 3
}
'
{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "10",
        "_score" : null,
        "_source" : {
          "account_number" : 10,
          "balance" : 46170,
          "firstname" : "Dominique",
          "lastname" : "Park",
          "age" : 37,
          "gender" : "F",
          "address" : "100 Gatling Place",
          "employer" : "Conjurica",
          "email" : "dominiquepark@conjurica.com",
          "city" : "Omar",
          "state" : "NJ"
        },
        "sort" : [
          10
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "11",
        "_score" : null,
        "_source" : {
          "account_number" : 11,
          "balance" : 20203,
          "firstname" : "Jenkins",
          "lastname" : "Haney",
          "age" : 20,
          "gender" : "M",
          "address" : "740 Ferry Place",
          "employer" : "Qimonk",
          "email" : "jenkinshaney@qimonk.com",
          "city" : "Steinhatchee",
          "state" : "GA"
        },
        "sort" : [
          11
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "12",
        "_score" : null,
        "_source" : {
          "account_number" : 12,
          "balance" : 37055,
          "firstname" : "Stafford",
          "lastname" : "Brock",
          "age" : 20,
          "gender" : "F",
          "address" : "296 Wythe Avenue",
          "employer" : "Uncorp",
          "email" : "staffordbrock@uncorp.com",
          "city" : "Bend",
          "state" : "AL"
        },
        "sort" : [
          12
        ]
      }
    ]
  }
}

Now can start to construct queries that are a bit more interesting than match_all.

To search for specific terms within a field, we can use a match query. For example, the following request searches the address field to find customers whose addresses contain mill or lane:

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "address": "mill lane" } }
}
'
{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 19,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "winnieholland@neteria.com",
          "city" : "Urie",
          "state" : "IL"
        }
      },  
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "970",
        "_score" : 5.4032025,
        "_source" : {
          "account_number" : 970,
          "balance" : 19648,
          "firstname" : "Forbes",
          "lastname" : "Wallace",
          "age" : 28,
          "gender" : "M",
          "address" : "990 Mill Road",
          "employer" : "Pheast",
          "email" : "forbeswallace@pheast.com",
          "city" : "Lopezo",
          "state" : "AK"
        }
      },

To perform a phrase search rather than matching individual terms, we use match_phrase instead of match. For example, the following request only matches addresses that contain the phrase mill lane:

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match_phrase": { "address": "mill lane" } }
}
' 
{
  "took" : 45,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "winnieholland@neteria.com",
          "city" : "Urie",
          "state" : "IL"
        }
      }
    ]
  }
}

To construct more complex queries, we can use a bool query to combine multiple query criteria. We can designate criteria as required (must match), desirable (should match), or undesirable (must not match).

For example, the following request searches the bank index for accounts that belong to customers who are 33 years old, but excludes anyone who lives in Idaho (ID):

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "33" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}
'
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 50,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "18",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 18,
          "balance" : 4180,
          "firstname" : "Dale",
          "lastname" : "Adams",
          "age" : 33,
          "gender" : "M",
          "address" : "467 Hutchinson Court",
          "employer" : "Boink",
          "email" : "daleadams@boink.com",
          "city" : "Orick",
          "state" : "MD"
        }
      },
      ...

Each must, should, and must_not element in a Boolean query is referred to as a query clause. How well a document meets the criteria in each must or should clause contributes to the document’s relevance score. The higher the score, the better the document matches our search criteria. By default, Elasticsearch returns documents ranked by these relevance scores.

The criteria in a must_not clause is treated as a filter. It affects whether or not the document is included in the results, but does not contribute to how documents are scored. We can also explicitly specify arbitrary filters to include or exclude documents based on structured data.

For example, the following request uses a range filter to limit the results to accounts with a balance between $20,000 and $30,000 (inclusive).

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}
'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 217,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "49",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 49,
          "balance" : 29104,
          "firstname" : "Fulton",
          "lastname" : "Holt",
          "age" : 23,
          "gender" : "F",
          "address" : "451 Humboldt Street",
          "employer" : "Anocha",
          "email" : "fultonholt@anocha.com",
          "city" : "Sunriver",
          "state" : "RI"
        }
      },

Analyze samples

This section is based on Elasticsearch Reference [7.9] » Getting started with Elasticsearch » Analyze results with aggregations

Elasticsearch aggregations enable us to get meta-information about our search results and answer questions like, "How many account holders are in Texas?" or "What’s the average balance of accounts in Tennessee?" We can search documents, filter hits, and use aggregations to analyze the results all in one request.

For example, the following request uses a terms aggregation to group all of the accounts in the bank index by state, and returns the ten states with the most accounts in descending order:

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}
'
{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 743,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30
        },
        {
          "key" : "MD",
          "doc_count" : 28
        },
        {
          "key" : "ID",
          "doc_count" : 27
        },
        {
          "key" : "AL",
          "doc_count" : 25
        },
        {
          "key" : "ME",
          "doc_count" : 25
        },
        {
          "key" : "TN",
          "doc_count" : 25
        },
        {
          "key" : "WY",
          "doc_count" : 25
        },
        {
          "key" : "DC",
          "doc_count" : 24
        },
        {
          "key" : "MA",
          "doc_count" : 24
        },
        {
          "key" : "ND",
          "doc_count" : 24
        }
      ]
    }
  }
}

The buckets in the response are the values of the state field. The doc_count shows the number of accounts in each state. For example, we can see that there are 27 accounts in ID (Idaho). Because the request set size=0, the response only contains the aggregation results but not including the details of the accounts like this:

    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 1,
          "balance" : 39225,
          "firstname" : "Amber",
          "lastname" : "Duke",
          "age" : 32,
          "gender" : "M",
          "address" : "880 Holmes Lane",
          "employer" : "Pyrami",
          "email" : "amberduke@pyrami.com",
          "city" : "Brogan",
          "state" : "IL"
        }
      },
      ...

We can combine aggregations to build more complex summaries of our data. For example, the following request nests an avg aggregation within the previous group_by_state aggregation to calculate the average account balances for each state.

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}
'
{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 743,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30,
          "average_balance" : {
            "value" : 26073.3
          }
        },
        {
          "key" : "MD",
          "doc_count" : 28,
          "average_balance" : {
            "value" : 26161.535714285714
          }
        },
        {
          "key" : "ID",
          "doc_count" : 27,
          "average_balance" : {
            "value" : 24368.777777777777
          }
        },
        {
          "key" : "AL",
          "doc_count" : 25,
          "average_balance" : {
            "value" : 25739.56
          }
        },
        {
          "key" : "ME",
          "doc_count" : 25,
          "average_balance" : {
            "value" : 21663.0
          }
        },
        {
          "key" : "TN",
          "doc_count" : 25,
          "average_balance" : {
            "value" : 28365.4
          }
        },
        {
          "key" : "WY",
          "doc_count" : 25,
          "average_balance" : {
            "value" : 21731.52
          }
        },
        {
          "key" : "DC",
          "doc_count" : 24,
          "average_balance" : {
            "value" : 23180.583333333332
          }
        },
        {
          "key" : "MA",
          "doc_count" : 24,
          "average_balance" : {
            "value" : 29600.333333333332
          }
        },
        {
          "key" : "ND",
          "doc_count" : 24,
          "average_balance" : {
            "value" : 26577.333333333332
          }
        }
      ]
    }
  }
}

Instead of sorting the results by count, we could sort using the result of the nested aggregation by specifying the order within the terms aggregation:

$ curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        }
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}
'

{
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 827,
      "buckets" : [
        {
          "key" : "CO",
          "doc_count" : 14,
          "average_balance" : {
            "value" : 32460.35714285714
          }
        },
        {
          "key" : "NE",
          "doc_count" : 16,
          "average_balance" : {
            "value" : 32041.5625
          }
        },
        {
          "key" : "AZ",
          "doc_count" : 14,
          "average_balance" : {
            "value" : 31634.785714285714
          }
        },
        {
          "key" : "MT",
          "doc_count" : 17,
          "average_balance" : {
            "value" : 31147.41176470588
          }
        },
        {
          "key" : "VA",
          "doc_count" : 16,
          "average_balance" : {
            "value" : 30600.0625
          }
        },
        {
          "key" : "GA",
          "doc_count" : 19,
          "average_balance" : {
            "value" : 30089.0
          }
        },
        {
          "key" : "MA",
          "doc_count" : 24,
          "average_balance" : {
            "value" : 29600.333333333332
          }
        },
        {
          "key" : "IL",
          "doc_count" : 22,
          "average_balance" : {
            "value" : 29489.727272727272
          }
        },
        {
          "key" : "NM",
          "doc_count" : 14,
          "average_balance" : {
            "value" : 28792.64285714286
          }
        },
        {
          "key" : "LA",
          "doc_count" : 17,
          "average_balance" : {
            "value" : 28791.823529411766
          }
        }
      ]
    }
  }
}

Refs

Credits: this post is based on the repo: Elastic stack (ELK) on Docker