Skip to content

Bioinformatics and HPC

This guide covers flexFS configuration optimized for bioinformatics pipelines (GATK, Cromwell, Nextflow, Snakemake) and HPC workloads that process large genomic data files.

| Format | Typical Size | Access Pattern | |---|---|---| | FASTQ | 10-100 GiB | Sequential read | | BAM/CRAM | 5-50 GiB | Sequential and random read | | VCF/gVCF | 100 MiB - 10 GiB | Sequential read, small writes | | Reference genomes | 1-3 GiB | Random read (heavily cached) | | Intermediate files | Variable | Write-once, read-once |

For large-file genomics workloads, use 4 MiB (default) or 8 MiB block size. Larger blocks reduce metadata overhead for sequential reads of multi-GiB files:

Terminal window
configure.flexfs create volume \
--name genomics \
--metaStoreID 1 --blockStoreID 1 \
--blockSize 4MiB \
--compression lz4
Block SizeBest For
256 KiBMany small files, random access
1 MiBMixed workloads
4 MiB (default)Large files, sequential reads
8 MiBVery large files, streaming I/O

LZ4 (default) offers the best balance of throughput and compression ratio for genomic data. BAM files are already compressed internally, so flexFS block-level compression provides modest additional savings. FASTQ files compress well with LZ4.

Set retention based on your compliance requirements:

Terminal window
configure.flexfs update volume genomics --retention 30d

Use --retention forever for regulatory datasets that must never be purged.

Terminal window
mount.flexfs start genomics /mnt/genomics \
--diskFolder /local-nvme/cache \
--diskQuota 50%

For HPC clusters where each node processes a subset of data:

Terminal window
mount.flexfs start genomics /mnt/genomics \
--diskFolder /tmp/flexfs-cache \
--diskQuota 20G \
--noATime

The --noATime flag eliminates access-time metadata updates, reducing metadata server load during read-heavy stages.

Cromwell’s scattered tasks read shared reference files and per-sample BAMs:

  • Reference genomes: Cached after first access on each node. The default disk cache captures these.
  • Per-sample BAMs: Sequential reads benefit from block prefetching (enabled by default).
  • Output files: Written sequentially. LZ4 compression handles the output well.

Nextflow uses a work directory for intermediate files:

  • Mount the work directory on flexFS for shared access across nodes.
  • Use --noATime to avoid metadata overhead from Nextflow’s frequent file polls.
  • Consider a separate volume for scratch data with shorter retention.

Snakemake’s file-based dependency tracking generates many stat() calls:

  • The attribute cache (--attrValid, default 3600s) absorbs repeated stat calls.
  • For pipelines where files are written and immediately checked, the default 1-second entry cache is appropriate.
  1. Local SSD cache: Always enable disk caching on compute nodes. This is the single largest performance improvement for bioinformatics workloads.
  2. Prefetching: Block prefetching is on by default and significantly improves sequential read throughput. Do not disable it.
  3. Proxy groups: For multi-region HPC clusters, deploy proxy groups in each compute region to avoid cross-region data transfer costs.
  4. Separate volumes: Use separate volumes for reference data (long retention, read-heavy) and scratch/intermediate data (short retention, write-heavy).