Bioinformatics and HPC
This guide covers flexFS configuration optimized for bioinformatics pipelines (GATK, Cromwell, Nextflow, Snakemake) and HPC workloads that process large genomic data files.
Typical File Characteristics
Section titled “Typical File Characteristics”| Format | Typical Size | Access Pattern |
|---|---|---|
| FASTQ | 10-100 GiB | Sequential read |
| BAM/CRAM | 5-50 GiB | Sequential and random read |
| VCF/gVCF | 100 MiB - 10 GiB | Sequential read, small writes |
| Reference genomes | 1-3 GiB | Random read (heavily cached) |
| Intermediate files | Variable | Write-once, read-once |
Volume Configuration
Section titled “Volume Configuration”Block Size
Section titled “Block Size”For large-file genomics workloads, use 4 MiB (default) or 8 MiB block size. Larger blocks reduce metadata overhead for sequential reads of multi-GiB files:
configure.flexfs create volume \ --name genomics \ --metaStoreID 1 --blockStoreID 1 \ --blockSize 4MiB \ --compression lz4| Block Size | Best For |
|---|---|
| 256 KiB | Many small files, random access |
| 1 MiB | Mixed workloads |
| 4 MiB (default) | Large files, sequential reads |
| 8 MiB | Very large files, streaming I/O |
Compression
Section titled “Compression”LZ4 (default) offers the best balance of throughput and compression ratio for genomic data. BAM files are already compressed internally, so flexFS block-level compression provides modest additional savings. FASTQ files compress well with LZ4.
Retention
Section titled “Retention”Set retention based on your compliance requirements:
configure.flexfs update volume genomics --retention 30dUse --retention forever for regulatory datasets that must never be purged.
Mount Configuration
Section titled “Mount Configuration”Standard Genomics Pipeline Mount
Section titled “Standard Genomics Pipeline Mount”mount.flexfs start genomics /mnt/genomics \ --diskFolder /local-nvme/cache \ --diskQuota 50%HPC Cluster Mount (per-node)
Section titled “HPC Cluster Mount (per-node)”For HPC clusters where each node processes a subset of data:
mount.flexfs start genomics /mnt/genomics \ --diskFolder /tmp/flexfs-cache \ --diskQuota 20G \ --noATimeThe --noATime flag eliminates access-time metadata updates, reducing metadata server load during read-heavy stages.
Pipeline-Specific Guidance
Section titled “Pipeline-Specific Guidance”GATK / Cromwell
Section titled “GATK / Cromwell”Cromwell’s scattered tasks read shared reference files and per-sample BAMs:
- Reference genomes: Cached after first access on each node. The default disk cache captures these.
- Per-sample BAMs: Sequential reads benefit from block prefetching (enabled by default).
- Output files: Written sequentially. LZ4 compression handles the output well.
Nextflow
Section titled “Nextflow”Nextflow uses a work directory for intermediate files:
- Mount the work directory on flexFS for shared access across nodes.
- Use
--noATimeto avoid metadata overhead from Nextflow’s frequent file polls. - Consider a separate volume for scratch data with shorter retention.
Snakemake
Section titled “Snakemake”Snakemake’s file-based dependency tracking generates many stat() calls:
- The attribute cache (
--attrValid, default 3600s) absorbs repeated stat calls. - For pipelines where files are written and immediately checked, the default 1-second entry cache is appropriate.
Performance Tips
Section titled “Performance Tips”- Local SSD cache: Always enable disk caching on compute nodes. This is the single largest performance improvement for bioinformatics workloads.
- Prefetching: Block prefetching is on by default and significantly improves sequential read throughput. Do not disable it.
- Proxy groups: For multi-region HPC clusters, deploy proxy groups in each compute region to avoid cross-region data transfer costs.
- Separate volumes: Use separate volumes for reference data (long retention, read-heavy) and scratch/intermediate data (short retention, write-heavy).