Alerting

This page provides sample Prometheus alerting rules for monitoring flexFS infrastructure. Adapt the thresholds to your environment and SLAs.

Alert rules file

Add the following to a Prometheus rules file (e.g., flexfs-alerts.yml) and reference it in your prometheus.yml:

rule_files:
  - "flexfs-alerts.yml"

Metadata server alerts

groups:
  - name: flexfs-meta
    rules:

      - alert: FlexFSMetaServerDown
        expr: up{job="flexfs-meta"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "flexFS metadata server is down"
          description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."

      - alert: FlexFSMetaHighRPCLatency
        expr: |
          histogram_quantile(0.99, sum(rate(flexfs_meta_rpc_duration_seconds_bucket[5m])) by (le, instance))
          > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "flexFS metadata server p99 RPC latency is high"
          description: "{{ $labels.instance }} p99 RPC latency has been above 100ms for 5 minutes."

      - alert: FlexFSMetaNoActiveSessions
        expr: sum(flexfs_meta_sessions) == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "No active flexFS mount sessions"
          description: "No mount clients have been connected for 10 minutes. This may indicate a network or configuration issue."

      - alert: FlexFSMetaDiskUsageHigh
        expr: |
          sum(flexfs_meta_db_disk_usage_bytes)
          / flexfs_meta_db_folder_disk_capacity_bytes
          > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "flexFS metadata database disk usage above 85%"
          description: "Database disk usage on {{ $labels.instance }} is {{ $value | humanizePercentage }}."

      - alert: FlexFSMetaDiskUsageCritical
        expr: |
          sum(flexfs_meta_db_disk_usage_bytes)
          / flexfs_meta_db_folder_disk_capacity_bytes
          > 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "flexFS metadata database disk usage above 95%"
          description: "Database disk usage on {{ $labels.instance }} is {{ $value | humanizePercentage }}. Immediate action required."

      - alert: FlexFSMetaHighCompactionDebt
        expr: flexfs_meta_db_compaction_estimated_debt_bytes > 1073741824
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "flexFS metadata database compaction debt is high"
          description: "Volume {{ $labels.volume_id }} has {{ $value | humanize1024 }} of compaction debt."

Proxy server alerts

  - name: flexfs-proxy
    rules:

      - alert: FlexFSProxyServerDown
        expr: up{job="flexfs-proxy"} == 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "flexFS proxy server is down"
          description: "{{ $labels.instance }} has been unreachable for more than 2 minutes. Mount clients will fall back to direct storage access."

      - alert: FlexFSProxyHighRESTLatency
        expr: |
          histogram_quantile(0.99, sum(rate(flexfs_proxy_rest_duration_seconds_bucket[5m])) by (le, instance))
          > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "flexFS proxy server p99 REST latency is high"
          description: "{{ $labels.instance }} p99 REST latency has been above 500ms for 5 minutes."

      - alert: FlexFSProxyCacheNearQuota
        expr: |
          (flexfs_proxy_cache_clean_bytes + flexfs_proxy_cache_dirty_bytes)
          / flexfs_proxy_cache_disk_quota_bytes
          > 0.90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "flexFS proxy cache is above 90% of quota"
          description: "{{ $labels.instance }} cache utilization is {{ $value | humanizePercentage }}."

      - alert: FlexFSProxyDirtyBlocksHigh
        expr: flexfs_proxy_cache_dirty_blocks > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "flexFS proxy has a large dirty block queue"
          description: "{{ $labels.instance }} has {{ $value }} dirty blocks pending writeback. Writeback may not be keeping up with write load."

      - alert: FlexFSProxyDiskCapacityLow
        expr: |
          (flexfs_proxy_cache_clean_bytes + flexfs_proxy_cache_dirty_bytes)
          / flexfs_proxy_cache_disk_capacity_bytes
          > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "flexFS proxy disk usage is high"
          description: "{{ $labels.instance }} is using {{ $value | humanizePercentage }} of disk capacity."

Alertmanager integration

Configure Prometheus to send alerts to Alertmanager:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - "alertmanager:9093"

Operational runbook references

Alert	Immediate action
MetaServerDown	Check systemd status, server logs, network connectivity. Restart if needed.
MetaHighRPCLatency	Check disk I/O, database compaction debt, network latency. Consider increasing database memory.
MetaDiskUsageHigh	Add disk space or move the database folder to a larger volume. Check for excessive retention.
ProxyServerDown	Check systemd status, server logs. Mount clients will use direct storage in the meantime.
ProxyCacheNearQuota	Increase `--diskQuota` or add disk space. Eviction will keep the cache within quota.
ProxyDirtyBlocksHigh	Check storage backend health and latency. Consider increasing `--writebackActive`.

Next steps

Metrics reference — full metrics catalog
Prometheus setup — scrape configuration
Grafana dashboards — visualization