Monitoring Metrics
Collecting data is cheap, but not having it when we need it can be expensive.
This is broad topic and the metrics we should use depends on what we are monitoring.
So, in this tutorial, we'll only handle some of the simple ones.
Web server (at 2017-01-01 01:12:34 UTC)
Subtype | Description | Value |
---|---|---|
throughput(rps) | requests per second | 312 |
success | percentage of responses that are 2xx since last measurement | 99.1 |
error | percentage of responses that are 5xx since last measurement | 0.1 |
performance | 90th percentile response time in seconds | 0.4 |
Requests per Second (RPS)
RPS is the evaluation of how many requests per second are being sent to a target server. In other words, this metric is called Average Load and it allows us to understand what load our web application currently works under. Usually, this is calculated as a count of the requests received during a measurement period, where the period is represented in seconds. Generally, the measurement period (often called monitoring period) is in the range of 1 - 5 minutes.
Error Rates
Naturally some errors may occur when processing requests, especially when under a big load. The Error Rate is usually calculated as a percentage of problem requests relative to all requests, and it reflects how many response HTTP status codes indicate an error on the server - including any requests that never get a response (timed out).
It is known that web servers return an HTTP Status Code in the response header. Normal codes are usually 200 (OK) or something in the 3xx range, indicating a redirect on the server. Common error codes are 4xx and 5xx, which mean the web server knows it has a problem fulfilling that request.
Error Rate is a significant metric because it measures "performance failure" in the application and tells us how many failed requests have occurred at a particular point in time.
Normally, no one can define the tolerance for Error Rate in their web application. Some consider an Error Rate of less than 1% successful. However, normally we must try to minimize possible errors in order to avoid performance problems, and constantly work to eliminate them.
Average Response Times (ART)
By measuring the duration of every request/response cycle, we will be able to evaluate how long it takes the target web application to generate a response. The ART takes into consideration every round trip request/response cycle during a Monitoring period and calculates the mathematical mean of all the response times.
The resulting metric is a reflection of the speed of the web application - perhaps the best indicator of how the target site is performing, from the users' perspective. Please take into account that the ART includes the lead time of any resource being used during response preparation. Thus, the average will be significantly affected by any slow components. The recommended standard unit of measurement for ART is milliseconds.
Peak Response Times (PRT)
Similar to the ART, PRT also measures the round trip of request/response cycles, however the peak will tell us what the longest cycle is at that point in the test. For instance, if we are looking at a graph that is showing a 5 minute monitoring period and the PRT is 13 seconds, then we know that one of our requests took that long. In the case where the average calculation may be sub-second (because our other resources had speedy responses), we may still not be troubled and just consider that there is no problem yet.
But, when the ART and PRT start becoming comparable, that indicates that we undoubtedly will have a problem in our server. Generally, the PRT shows that at least one of the resources is potentially problematic. It can reflect an anomaly in the application, or it can be due to "expensive" database queries, etc. The standard measurement unit of PRT is recommended to be milliseconds.
Data store (at 2017-01-01 01:12:34 UTC)
Subtype | Description | Value |
---|---|---|
throughput | queries per second | 949 |
success | percentage of queries successfully executed since last measurement | 100 |
error | percentage of queries yielding exceptions since last measurement | 0 |
error | percentage of queries returning stale data since last measurement | 4.2 |
performance | 90th percentile query time in seconds | 0.02 |
Query load: Monitoring the number of queries currently in progress can give us a rough idea of how many requests our cluster is dealing with at any particular moment in time. Consider alerting on unusual spikes or dips that may point to underlying problems. We may also want to monitor the size of the search thread pool queue.
Query latency: Monitoring tools can help us use the available metrics to calculate the average query latency by sampling the total number of queries and the total elapsed time at regular intervals. Set an alert if latency exceeds a threshold, and if it fires, look for potential resource bottlenecks, or investigate whether we need to optimize our queries.
Fetch latency: The second part of the search process, the fetch phase, should typically take much less time than the query phase. If we notice this metric consistently increasing, this could indicate a problem with slow disks, enriching of documents (highlighting relevant text in search results, etc.), or requesting too many results.
Resource Metrics (at 2017-01-01 01:12:34 UTC)
Subtype | Utilization | Saturation | Errors | Availability |
---|---|---|---|---|
Disk IO | % time that device was busy | wait queue length | # device errors | % time writable |
Memory | % of total memory capacity in use | swap usage | N/A (not usually observable) | N/A |
Microservice | average % time each request-servicing thread was busy | # enqueued requests | # internal errors such as caught exceptions | % time service is reachable |
Database | average % time each connection was busy | # enqueued queries | # internal errors, e.g. replication errors | % time database is reachable |
Refs:
- Google Cloud Platform : Stackdriver Monitoring Documentation
- Monitoring 101: Collecting the right data
- Essential Server Performance Metrics you should know, but were reluctant to ask
DevOps
DevOps / Sys Admin Q & A
Linux - system, cmds & shell
- Linux Tips - links, vmstats, rsync
- Linux Tips 2 - ctrl a, curl r, tail -f, umask
- Linux - bash I
- Linux - bash II
- Linux - Uncompressing 7z file
- Linux - sed I (substitution: sed 's///', sed -i)
- Linux - sed II (file spacing, numbering, text conversion and substitution)
- Linux - sed III (selective printing of certain lines, selective definition of certain lines)
- Linux - 7 File types : Regular, Directory, Block file, Character device file, Pipe file, Symbolic link file, and Socket file
- Linux shell programming - introduction
- Linux shell programming - variables and functions (readonly, unset, and functions)
- Linux shell programming - special shell variables
- Linux shell programming : arrays - three different ways of declaring arrays & looping with $*/$@
- Linux shell programming : operations on array
- Linux shell programming : variables & commands substitution
- Linux shell programming : metacharacters & quotes
- Linux shell programming : input/output redirection & here document
- Linux shell programming : loop control - for, while, break, and break n
- Linux shell programming : string
- Linux shell programming : for-loop
- Linux shell programming : if/elif/else/fi
- Linux shell programming : Test
- Managing User Account - useradd, usermod, and userdel
- Linux Secure Shell (SSH) I : key generation, private key and public key
- Linux Secure Shell (SSH) II : ssh-agent & scp
- Linux Secure Shell (SSH) III : SSH Tunnel as Proxy - Dynamic Port Forwarding (SOCKS Proxy)
- Linux Secure Shell (SSH) IV : Local port forwarding (outgoing ssh tunnel)
- Linux Secure Shell (SSH) V : Reverse SSH Tunnel (remote port forwarding / incoming ssh tunnel) /)
- Linux Processes and Signals
- Linux Drivers 1
- tcpdump
- Linux Debugging using gdb
- Embedded Systems Programming I - Introduction
- Embedded Systems Programming II - gcc ARM Toolchain and Simple Code on Ubuntu/Fedora
- LXC (Linux Container) Install and Run
- Linux IPTables
- Hadoop - 1. Setting up on Ubuntu for Single-Node Cluster
- Hadoop - 2. Runing on Ubuntu for Single-Node Cluster
- ownCloud 7 install
- Ubuntu 14.04 guest on Mac OSX host using VirtualBox I
- Ubuntu 14.04 guest on Mac OSX host using VirtualBox II
- Windows 8 guest on Mac OSX host using VirtualBox I
- Ubuntu Package Management System (apt-get vs dpkg)
- RPM Packaging
- How to Make a Self-Signed SSL Certificate
- Linux Q & A
- DevOps / Sys Admin questions
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization