Bioinformatics and HPC

This guide covers flexFS configuration optimized for bioinformatics pipelines (GATK, Cromwell, Nextflow, Snakemake) and HPC workloads that process large genomic data files.

Typical File Characteristics

Format	Typical Size	Access Pattern
FASTQ	10-100 GiB	Sequential read
BAM/CRAM	5-50 GiB	Sequential and random read
VCF/gVCF	100 MiB - 10 GiB	Sequential read, small writes
Reference genomes	1-3 GiB	Random read (heavily cached)
Intermediate files	Variable	Write-once, read-once

Volume Configuration

Block Size

For large-file genomics workloads, use 4 MiB (default) or 8 MiB block size. Larger blocks reduce metadata overhead for sequential reads of multi-GiB files:

configure.flexfs create volume \
  --name genomics \
  --metaStoreID 1 --blockStoreID 1 \
  --blockSize 4MiB \
  --compression lz4

Block Size	Best For
256 KiB	Many small files, random access
1 MiB	Mixed workloads
4 MiB (default)	Large files, sequential reads
8 MiB	Very large files, streaming I/O

Compression

LZ4 (default) offers the best balance of throughput and compression ratio for genomic data. BAM files are already compressed internally, so flexFS block-level compression provides modest additional savings. FASTQ files compress well with LZ4.

Retention

Set retention based on your compliance requirements:

configure.flexfs update volume genomics --retention 30d

Use --retention forever for regulatory datasets that must never be purged.

Mount Configuration

Standard Genomics Pipeline Mount

mount.flexfs start genomics /mnt/genomics \
  --diskFolder /local-nvme/cache \
  --diskQuota 50%

HPC Cluster Mount (per-node)

For HPC clusters where each node processes a subset of data:

mount.flexfs start genomics /mnt/genomics \
  --diskFolder /tmp/flexfs-cache \
  --diskQuota 20G \
  --noATime

The --noATime flag eliminates access-time metadata updates, reducing metadata server load during read-heavy stages.

Pipeline-Specific Guidance

GATK / Cromwell

Cromwell’s scattered tasks read shared reference files and per-sample BAMs:

Reference genomes: Cached after first access on each node. The default disk cache captures these.
Per-sample BAMs: Sequential reads benefit from block prefetching (enabled by default).
Output files: Written sequentially. LZ4 compression handles the output well.

Nextflow

Nextflow uses a work directory for intermediate files:

Mount the work directory on flexFS for shared access across nodes.
Use --noATime to avoid metadata overhead from Nextflow’s frequent file polls.
Consider a separate volume for scratch data with shorter retention.

Snakemake

Snakemake’s file-based dependency tracking generates many stat() calls:

The attribute cache (--attrValid, default 3600s) absorbs repeated stat calls.
For pipelines where files are written and immediately checked, the default 1-second entry cache is appropriate.

Performance Tips

Local SSD cache: Always enable disk caching on compute nodes. This is the single largest performance improvement for bioinformatics workloads.
Prefetching: Block prefetching is on by default and significantly improves sequential read throughput. Do not disable it.
Proxy groups: For multi-region HPC clusters, deploy proxy groups in each compute region to avoid cross-region data transfer costs.
Separate volumes: Use separate volumes for reference data (long retention, read-heavy) and scratch/intermediate data (short retention, write-heavy).