Skip to content

AI/ML Data Pipelines

This guide covers flexFS configuration for machine learning workflows including training data ingest, model checkpointing, and multi-GPU/multi-node distributed training.

WorkloadFile SizesAccess Pattern
Training datasets (ImageNet, etc.)Millions of small files or large archivesRandom read, high IOPS
Large datasets (video, medical imaging)GiB-scale filesSequential read
Model checkpoints1-100 GiBSequential write, occasional read
TFRecord / WebDataset shards100 MiB - 1 GiBSequential read
Logs and metricsSmall, append-onlySequential write

For Sharded Datasets (TFRecord, WebDataset)

Section titled “For Sharded Datasets (TFRecord, WebDataset)”
Terminal window
configure.flexfs create volume \
--name training-data \
--metaStoreID 1 --blockStoreID 1 \
--blockSize 4MiB \
--compression lz4

If your dataset consists of millions of small files (e.g., individual images):

Terminal window
configure.flexfs create volume \
--name image-data \
--metaStoreID 1 --blockStoreID 1 \
--blockSize 256KiB \
--compression lz4

The smaller block size reduces wasted space when files are smaller than the block size.

Terminal window
configure.flexfs create volume \
--name checkpoints \
--metaStoreID 1 --blockStoreID 1 \
--blockSize 4MiB \
--compression zstd \
--retention 30d

Zstd provides better compression ratios than LZ4 for checkpoint data, which is written infrequently and read occasionally.

Terminal window
mount.flexfs start training-data /mnt/data \
--diskFolder /local-nvme/cache \
--diskQuota 80% \
--noATime

Key settings:

  • --diskFolder on a local NVMe drive absorbs repeated reads of hot data (e.g., the same dataset across multiple epochs).
  • --diskQuota 80% allows the cache to use most of the local SSD.
  • --noATime eliminates metadata writes from reads.

Every training node mounts the same volume:

Terminal window
# On each node
mount.flexfs start training-data /mnt/data \
--diskFolder /local-nvme/cache \
--diskQuota 50% \
--noATime

All nodes see the same filesystem. Data loaders on each node read different shards, and the local disk cache captures the working set per node.

PyTorch’s DataLoader with num_workers > 0 spawns multiple reader processes. FlexFS handles concurrent reads from the same mount point:

dataset = ImageFolder('/mnt/data/imagenet/train', transform=transform)
loader = DataLoader(dataset, batch_size=256, num_workers=8, pin_memory=True)

The FUSE allow_other option is enabled by default, allowing all user processes to access the mount.

Write checkpoints directly to flexFS. For large models, the sequential write throughput is determined by the network path to storage (direct or via proxy):

torch.save(model.state_dict(), '/mnt/checkpoints/epoch_10.pt')

For hybrid cloud deployments with on-prem compute, enable --diskWriteback or use a proxy group to mask write latency.

The Hugging Face datasets library uses memory-mapped Arrow files. Mount the cache directory on flexFS for shared access:

import datasets
datasets.config.HF_DATASETS_CACHE = "/mnt/data/hf-cache"
  1. Pre-shard your data: Use formats like TFRecord or WebDataset instead of millions of small files. This reduces metadata overhead.
  2. Cache sizing: Size the local disk cache to hold at least one epoch’s worth of data per node if possible.
  3. Proxy groups: For multi-region training, deploy proxy groups near GPU clusters to avoid cross-region reads.
  4. Separate volumes: Use different volumes for immutable training data (read-only tokens, long retention) and mutable checkpoints (write access, shorter retention).
  5. Read-only mounts: Mount training data as read-only (--ro) on training nodes to eliminate any write-path overhead.