Alerting
This page provides sample Prometheus alerting rules for monitoring flexFS infrastructure. Adapt the thresholds to your environment and SLAs.
Alert rules file
Section titled “Alert rules file”Add the following to a Prometheus rules file (e.g., flexfs-alerts.yml) and reference it in your prometheus.yml:
rule_files: - "flexfs-alerts.yml"Metadata server alerts
Section titled “Metadata server alerts”groups: - name: flexfs-meta rules:
- alert: FlexFSMetaServerDown expr: up{job="flexfs-meta"} == 0 for: 2m labels: severity: critical annotations: summary: "flexFS metadata server is down" description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."
- alert: FlexFSMetaHighRPCLatency expr: | histogram_quantile(0.99, sum(rate(flexfs_meta_rpc_duration_seconds_bucket[5m])) by (le, instance)) > 0.1 for: 5m labels: severity: warning annotations: summary: "flexFS metadata server p99 RPC latency is high" description: "{{ $labels.instance }} p99 RPC latency has been above 100ms for 5 minutes."
- alert: FlexFSMetaNoActiveSessions expr: sum(flexfs_meta_sessions) == 0 for: 10m labels: severity: warning annotations: summary: "No active flexFS mount sessions" description: "No mount clients have been connected for 10 minutes. This may indicate a network or configuration issue."
- alert: FlexFSMetaDiskUsageHigh expr: | sum(flexfs_meta_db_disk_usage_bytes) / flexfs_meta_db_folder_disk_capacity_bytes > 0.85 for: 10m labels: severity: warning annotations: summary: "flexFS metadata database disk usage above 85%" description: "Database disk usage on {{ $labels.instance }} is {{ $value | humanizePercentage }}."
- alert: FlexFSMetaDiskUsageCritical expr: | sum(flexfs_meta_db_disk_usage_bytes) / flexfs_meta_db_folder_disk_capacity_bytes > 0.95 for: 5m labels: severity: critical annotations: summary: "flexFS metadata database disk usage above 95%" description: "Database disk usage on {{ $labels.instance }} is {{ $value | humanizePercentage }}. Immediate action required."
- alert: FlexFSMetaHighCompactionDebt expr: flexfs_meta_db_compaction_estimated_debt_bytes > 1073741824 for: 15m labels: severity: warning annotations: summary: "flexFS metadata database compaction debt is high" description: "Volume {{ $labels.volume_id }} has {{ $value | humanize1024 }} of compaction debt."Proxy server alerts
Section titled “Proxy server alerts” - name: flexfs-proxy rules:
- alert: FlexFSProxyServerDown expr: up{job="flexfs-proxy"} == 0 for: 2m labels: severity: warning annotations: summary: "flexFS proxy server is down" description: "{{ $labels.instance }} has been unreachable for more than 2 minutes. Mount clients will fall back to direct storage access."
- alert: FlexFSProxyHighRESTLatency expr: | histogram_quantile(0.99, sum(rate(flexfs_proxy_rest_duration_seconds_bucket[5m])) by (le, instance)) > 0.5 for: 5m labels: severity: warning annotations: summary: "flexFS proxy server p99 REST latency is high" description: "{{ $labels.instance }} p99 REST latency has been above 500ms for 5 minutes."
- alert: FlexFSProxyCacheNearQuota expr: | (flexfs_proxy_cache_clean_bytes + flexfs_proxy_cache_dirty_bytes) / flexfs_proxy_cache_disk_quota_bytes > 0.90 for: 10m labels: severity: warning annotations: summary: "flexFS proxy cache is above 90% of quota" description: "{{ $labels.instance }} cache utilization is {{ $value | humanizePercentage }}."
- alert: FlexFSProxyDirtyBlocksHigh expr: flexfs_proxy_cache_dirty_blocks > 1000 for: 10m labels: severity: warning annotations: summary: "flexFS proxy has a large dirty block queue" description: "{{ $labels.instance }} has {{ $value }} dirty blocks pending writeback. Writeback may not be keeping up with write load."
- alert: FlexFSProxyDiskCapacityLow expr: | (flexfs_proxy_cache_clean_bytes + flexfs_proxy_cache_dirty_bytes) / flexfs_proxy_cache_disk_capacity_bytes > 0.85 for: 10m labels: severity: warning annotations: summary: "flexFS proxy disk usage is high" description: "{{ $labels.instance }} is using {{ $value | humanizePercentage }} of disk capacity."Alertmanager integration
Section titled “Alertmanager integration”Configure Prometheus to send alerts to Alertmanager:
alerting: alertmanagers: - static_configs: - targets: - "alertmanager:9093"Operational runbook references
Section titled “Operational runbook references”| Alert | Immediate action |
|---|---|
| MetaServerDown | Check systemd status, server logs, network connectivity. Restart if needed. |
| MetaHighRPCLatency | Check disk I/O, database compaction debt, network latency. Consider increasing database memory. |
| MetaDiskUsageHigh | Add disk space or move the database folder to a larger volume. Check for excessive retention. |
| ProxyServerDown | Check systemd status, server logs. Mount clients will use direct storage in the meantime. |
| ProxyCacheNearQuota | Increase --diskQuota or add disk space. Eviction will keep the cache within quota. |
| ProxyDirtyBlocksHigh | Check storage backend health and latency. Consider increasing --writebackActive. |
Next steps
Section titled “Next steps”- Metrics reference — full metrics catalog
- Prometheus setup — scrape configuration
- Grafana dashboards — visualization