Below are sample PromQL expressions that you can use as a starting point for each metric. Note that the exact metric names (and label names) may vary depending on your instrumentation. Adjust the expressions as needed to match your environment.
Response Time
If you’re recording request duration as a histogram (for example, via a metric like http_request_duration_seconds
), you can compute a (95th percentile) response time over a 5‑minute window:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Note: Your existing cronjob solution for Thermo Fisher ask storing response times may supply its own metric. In that case, replace the metric name as appropriate.
Latency
Assuming you’re capturing a metric for latency (for example, time-to-first-byte as latency_seconds
), you might use an average over the last minute:
avg_over_time(latency_seconds[1m])
If you have latency buckets/histogram, you could similarly apply a quantile function.
Throughput
Using a counter (such as http_requests_total
), the throughput (requests per second) over a 1‑minute window can be computed as:
sum(rate(http_requests_total[1m]))
Error Rate
If you differentiate errors by status codes (for example, considering 5xx responses as errors), you can compute error rate (percentage) as:
(
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))
) * 100
Availability (Uptime)
Assuming that the built‑in Prometheus up
metric indicates whether a service instance is reachable (1 for up, 0 for down), you can calculate the percentage uptime over the last day as:
avg_over_time(up[1d]) * 100
Note: Depending on your use case, you may want to aggregate by instance/service.
CPU Utilization
Using node exporter data, CPU utilization (as a percentage) is typically computed by subtracting the idle time. For example, over a 5‑minute window:
(
1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
) * 100
Memory Usage
If you have metrics for total and available memory (as provided by node exporter), the percentage of used memory can be computed as:
(
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes
) * 100
Alternatively, if you prefer the absolute amount of memory consumed, simply use the difference in bytes.
Disk I/O Performance
For disk read and write speeds, you can use metrics like node_disk_read_bytes_total
and node_disk_written_bytes_total
. For example, to see the current read throughput:
rate(node_disk_read_bytes_total[1m])
And for write throughput:
rate(node_disk_written_bytes_total[1m])
These expressions return the average number of bytes read or written per second over the given period.
Network Bandwidth Utilization
Assuming you have metrics for network traffic (e.g. node_network_receive_bytes_total
and node_network_transmit_bytes_total
) and a known interface capacity metric (say, node_network_speed_bytes
), you can compute a utilization percentage like:
(
rate(node_network_receive_bytes_total[1m]) + rate(node_network_transmit_bytes_total[1m])
)
/
node_network_speed_bytes * 100
If the interface capacity isn’t available as a metric, you may need to hardcode that value or use additional labels.
Scalability
Scalability is often evaluated by how well the system maintains performance under increased load. While this is not a single metric, you can track proxy metrics such as:
-
The number of instances running (from Kubernetes metrics):
sum(kube_deployment_status_replicas_available) by (deployment)
-
The error rate or response time under load.
For example, if response times or error rates start to degrade as throughput increases, that indicates scalability challenges. Custom load or performance benchmarks may also be necessary to fully assess scalability.
These expressions should provide a good baseline for monitoring your system’s performance with Prometheus. Adjust windows (e.g., [1m]
, [5m]
, [1d]
) and label filters to suit your specific environment and monitoring needs.