Skip to content

Alerting

This page provides sample Prometheus alerting rules for monitoring flexFS infrastructure. Adapt the thresholds to your environment and SLAs.

Add the following to a Prometheus rules file (e.g., flexfs-alerts.yml) and reference it in your prometheus.yml:

prometheus.yml
rule_files:
- "flexfs-alerts.yml"
groups:
- name: flexfs-meta
rules:
- alert: FlexFSMetaServerDown
expr: up{job="flexfs-meta"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "flexFS metadata server is down"
description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."
- alert: FlexFSMetaHighRPCLatency
expr: |
histogram_quantile(0.99, sum(rate(flexfs_meta_rpc_duration_seconds_bucket[5m])) by (le, instance))
> 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "flexFS metadata server p99 RPC latency is high"
description: "{{ $labels.instance }} p99 RPC latency has been above 100ms for 5 minutes."
- alert: FlexFSMetaNoActiveSessions
expr: sum(flexfs_meta_sessions) == 0
for: 10m
labels:
severity: warning
annotations:
summary: "No active flexFS mount sessions"
description: "No mount clients have been connected for 10 minutes. This may indicate a network or configuration issue."
- alert: FlexFSMetaDiskUsageHigh
expr: |
sum(flexfs_meta_db_disk_usage_bytes)
/ flexfs_meta_db_folder_disk_capacity_bytes
> 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "flexFS metadata database disk usage above 85%"
description: "Database disk usage on {{ $labels.instance }} is {{ $value | humanizePercentage }}."
- alert: FlexFSMetaDiskUsageCritical
expr: |
sum(flexfs_meta_db_disk_usage_bytes)
/ flexfs_meta_db_folder_disk_capacity_bytes
> 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "flexFS metadata database disk usage above 95%"
description: "Database disk usage on {{ $labels.instance }} is {{ $value | humanizePercentage }}. Immediate action required."
- alert: FlexFSMetaHighCompactionDebt
expr: flexfs_meta_db_compaction_estimated_debt_bytes > 1073741824
for: 15m
labels:
severity: warning
annotations:
summary: "flexFS metadata database compaction debt is high"
description: "Volume {{ $labels.volume_id }} has {{ $value | humanize1024 }} of compaction debt."
- name: flexfs-proxy
rules:
- alert: FlexFSProxyServerDown
expr: up{job="flexfs-proxy"} == 0
for: 2m
labels:
severity: warning
annotations:
summary: "flexFS proxy server is down"
description: "{{ $labels.instance }} has been unreachable for more than 2 minutes. Mount clients will fall back to direct storage access."
- alert: FlexFSProxyHighRESTLatency
expr: |
histogram_quantile(0.99, sum(rate(flexfs_proxy_rest_duration_seconds_bucket[5m])) by (le, instance))
> 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "flexFS proxy server p99 REST latency is high"
description: "{{ $labels.instance }} p99 REST latency has been above 500ms for 5 minutes."
- alert: FlexFSProxyCacheNearQuota
expr: |
(flexfs_proxy_cache_clean_bytes + flexfs_proxy_cache_dirty_bytes)
/ flexfs_proxy_cache_disk_quota_bytes
> 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "flexFS proxy cache is above 90% of quota"
description: "{{ $labels.instance }} cache utilization is {{ $value | humanizePercentage }}."
- alert: FlexFSProxyDirtyBlocksHigh
expr: flexfs_proxy_cache_dirty_blocks > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "flexFS proxy has a large dirty block queue"
description: "{{ $labels.instance }} has {{ $value }} dirty blocks pending writeback. Writeback may not be keeping up with write load."
- alert: FlexFSProxyDiskCapacityLow
expr: |
(flexfs_proxy_cache_clean_bytes + flexfs_proxy_cache_dirty_bytes)
/ flexfs_proxy_cache_disk_capacity_bytes
> 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "flexFS proxy disk usage is high"
description: "{{ $labels.instance }} is using {{ $value | humanizePercentage }} of disk capacity."

Configure Prometheus to send alerts to Alertmanager:

prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- "alertmanager:9093"
AlertImmediate action
MetaServerDownCheck systemd status, server logs, network connectivity. Restart if needed.
MetaHighRPCLatencyCheck disk I/O, database compaction debt, network latency. Consider increasing database memory.
MetaDiskUsageHighAdd disk space or move the database folder to a larger volume. Check for excessive retention.
ProxyServerDownCheck systemd status, server logs. Mount clients will use direct storage in the meantime.
ProxyCacheNearQuotaIncrease --diskQuota or add disk space. Eviction will keep the cache within quota.
ProxyDirtyBlocksHighCheck storage backend health and latency. Consider increasing --writebackActive.