Monitoring Metrics

bogotobogo.com site search:

Collecting data is cheap, but not having it when we need it can be expensive.

What Metrics should we use?

This is broad topic and the metrics we should use depends on what we are monitoring.

So, in this tutorial, we'll only handle some of the simple ones.

Web server

Web server (at 2017-01-01 01:12:34 UTC)

Subtype	Description	Value
throughput(rps)	requests per second	312
success	percentage of responses that are 2xx since last measurement	99.1
error	percentage of responses that are 5xx since last measurement	0.1
performance	90th percentile response time in seconds	0.4

Requests per Second (RPS)
RPS is the evaluation of how many requests per second are being sent to a target server. In other words, this metric is called Average Load and it allows us to understand what load our web application currently works under. Usually, this is calculated as a count of the requests received during a measurement period, where the period is represented in seconds. Generally, the measurement period (often called monitoring period) is in the range of 1 - 5 minutes.

Error Rates
Naturally some errors may occur when processing requests, especially when under a big load. The Error Rate is usually calculated as a percentage of problem requests relative to all requests, and it reflects how many response HTTP status codes indicate an error on the server - including any requests that never get a response (timed out).
It is known that web servers return an HTTP Status Code in the response header. Normal codes are usually 200 (OK) or something in the 3xx range, indicating a redirect on the server. Common error codes are 4xx and 5xx, which mean the web server knows it has a problem fulfilling that request.
Error Rate is a significant metric because it measures "performance failure" in the application and tells us how many failed requests have occurred at a particular point in time.
Normally, no one can define the tolerance for Error Rate in their web application. Some consider an Error Rate of less than 1% successful. However, normally we must try to minimize possible errors in order to avoid performance problems, and constantly work to eliminate them.

Average Response Times (ART)
By measuring the duration of every request/response cycle, we will be able to evaluate how long it takes the target web application to generate a response. The ART takes into consideration every round trip request/response cycle during a Monitoring period and calculates the mathematical mean of all the response times.
The resulting metric is a reflection of the speed of the web application - perhaps the best indicator of how the target site is performing, from the users' perspective. Please take into account that the ART includes the lead time of any resource being used during response preparation. Thus, the average will be significantly affected by any slow components. The recommended standard unit of measurement for ART is milliseconds.

Peak Response Times (PRT)
Similar to the ART, PRT also measures the round trip of request/response cycles, however the peak will tell us what the longest cycle is at that point in the test. For instance, if we are looking at a graph that is showing a 5 minute monitoring period and the PRT is 13 seconds, then we know that one of our requests took that long. In the case where the average calculation may be sub-second (because our other resources had speedy responses), we may still not be troubled and just consider that there is no problem yet.
But, when the ART and PRT start becoming comparable, that indicates that we undoubtedly will have a problem in our server. Generally, the PRT shows that at least one of the resources is potentially problematic. It can reflect an anomaly in the application, or it can be due to "expensive" database queries, etc. The standard measurement unit of PRT is recommended to be milliseconds.

Data store (at 2017-01-01 01:12:34 UTC)

Subtype	Description	Value
throughput	queries per second	949
success	percentage of queries successfully executed since last measurement	100
error	percentage of queries yielding exceptions since last measurement	0
error	percentage of queries returning stale data since last measurement	4.2
performance	90th percentile query time in seconds	0.02

Query load: Monitoring the number of queries currently in progress can give us a rough idea of how many requests our cluster is dealing with at any particular moment in time. Consider alerting on unusual spikes or dips that may point to underlying problems. We may also want to monitor the size of the search thread pool queue.

Query latency: Monitoring tools can help us use the available metrics to calculate the average query latency by sampling the total number of queries and the total elapsed time at regular intervals. Set an alert if latency exceeds a threshold, and if it fires, look for potential resource bottlenecks, or investigate whether we need to optimize our queries.

Fetch latency: The second part of the search process, the fetch phase, should typically take much less time than the query phase. If we notice this metric consistently increasing, this could indicate a problem with slow disks, enriching of documents (highlighting relevant text in search results, etc.), or requesting too many results.

Subtype	Utilization	Saturation	Errors	Availability
Disk IO	% time that device was busy	wait queue length	# device errors	% time writable
Memory	% of total memory capacity in use	swap usage	N/A (not usually observable)	N/A
Microservice	average % time each request-servicing thread was busy	# enqueued requests	# internal errors such as caught exceptions	% time service is reachable
Database	average % time each connection was busy	# enqueued queries	# internal errors, e.g. replication errors	% time database is reachable

Monitoring Metrics

DevOps

DevOps / Sys Admin Q & A

Linux - system, cmds & shell

DevOps

DevOps / Sys Admin Q & A

Docker & K8s

Ansible 2.0

Terraform

AWS (Amazon Web Services)

Jenkins

Puppet

Chef

Elasticsearch search engine, Logstash, and Kibana

Vagrant

GCP (Google Cloud Platform)

Big Data & Hadoop Tutorials

Redis In-Memory Database

Powershell 4 Tutorial

Git/GitHub Tutorial

Subversion