Skip to content

Bioinformatics and HPC

This guide covers flexFS configuration optimized for bioinformatics pipelines (GATK, Cromwell, Nextflow, Snakemake) and HPC workloads that process large genomic data files.

FormatTypical SizeAccess Pattern
FASTQ10-100 GiBSequential read
BAM/CRAM5-50 GiBSequential and random read
VCF/gVCF100 MiB - 10 GiBSequential read, small writes
Reference genomes1-3 GiBRandom read (heavily cached)
Intermediate filesVariableWrite-once, read-once

For large-file genomics workloads, use 4 MiB (default) or 8 MiB block size. Larger blocks reduce metadata overhead for sequential reads of multi-GiB files:

Terminal window
configure.flexfs create volume \
--name genomics \
--metaStoreID 1 --blockStoreID 1 \
--blockSize 4MiB \
--compression lz4
Block SizeBest For
256 KiBMany small files, random access
1 MiBMixed workloads
4 MiB (default)Large files, sequential reads
8 MiBVery large files, streaming I/O

LZ4 (default) offers the best balance of throughput and compression ratio for genomic data. BAM files are already compressed internally, so flexFS block-level compression provides modest additional savings. FASTQ files compress well with LZ4.

Set retention based on your compliance requirements:

Terminal window
configure.flexfs update volume genomics --retention 30d

Use --retention forever for regulatory datasets that must never be purged.

Terminal window
mount.flexfs start genomics /mnt/genomics \
--diskFolder /local-nvme/cache \
--diskQuota 50%

For HPC clusters where each node processes a subset of data:

Terminal window
mount.flexfs start genomics /mnt/genomics \
--diskFolder /tmp/flexfs-cache \
--diskQuota 20G \
--noATime

The --noATime flag eliminates access-time metadata updates, reducing metadata server load during read-heavy stages.

Cromwell’s scattered tasks read shared reference files and per-sample BAMs:

  • Reference genomes: Cached after first access on each node. The default disk cache captures these.
  • Per-sample BAMs: Sequential reads benefit from block prefetching (enabled by default).
  • Output files: Written sequentially. LZ4 compression handles the output well.

Nextflow uses a work directory for intermediate files:

  • Mount the work directory on flexFS for shared access across nodes.
  • Use --noATime to avoid metadata overhead from Nextflow’s frequent file polls.
  • Consider a separate volume for scratch data with shorter retention.

Snakemake’s file-based dependency tracking generates many stat() calls:

  • The attribute cache (--attrValid, default 3600s) absorbs repeated stat calls.
  • For pipelines where files are written and immediately checked, the default 1-second entry cache is appropriate.
  1. Local SSD cache: Always enable disk caching on compute nodes. This is the single largest performance improvement for bioinformatics workloads.
  2. Prefetching: Block prefetching is on by default and significantly improves sequential read throughput. Do not disable it.
  3. Proxy groups: For multi-region HPC clusters, deploy proxy groups in each compute region to avoid cross-region data transfer costs.
  4. Separate volumes: Use separate volumes for reference data (long retention, read-heavy) and scratch/intermediate data (short retention, write-heavy).