AI/ML Data Pipelines

This guide covers flexFS configuration for machine learning workflows including training data ingest, model checkpointing, and multi-GPU/multi-node distributed training.

Common AI/ML File Patterns

Workload	File Sizes	Access Pattern
Training datasets (ImageNet, etc.)	Millions of small files or large archives	Random read, high IOPS
Large datasets (video, medical imaging)	GiB-scale files	Sequential read
Model checkpoints	1-100 GiB	Sequential write, occasional read
TFRecord / WebDataset shards	100 MiB - 1 GiB	Sequential read
Logs and metrics	Small, append-only	Sequential write

Volume Configuration

For Sharded Datasets (TFRecord, WebDataset)

configure.flexfs create volume \
  --name training-data \
  --metaStoreID 1 --blockStoreID 1 \
  --blockSize 4MiB \
  --compression lz4

For Many Small Files (ImageNet-style)

If your dataset consists of millions of small files (e.g., individual images):

configure.flexfs create volume \
  --name image-data \
  --metaStoreID 1 --blockStoreID 1 \
  --blockSize 256KiB \
  --compression lz4

The smaller block size reduces wasted space when files are smaller than the block size.

For Checkpoints

configure.flexfs create volume \
  --name checkpoints \
  --metaStoreID 1 --blockStoreID 1 \
  --blockSize 4MiB \
  --compression zstd \
  --retention 30d

Zstd provides better compression ratios than LZ4 for checkpoint data, which is written infrequently and read occasionally.

Mount Configuration

GPU Training Node

mount.flexfs start training-data /mnt/data \
  --diskFolder /local-nvme/cache \
  --diskQuota 80% \
  --noATime

Key settings:

--diskFolder on a local NVMe drive absorbs repeated reads of hot data (e.g., the same dataset across multiple epochs).
--diskQuota 80% allows the cache to use most of the local SSD.
--noATime eliminates metadata writes from reads.

Multi-Node Distributed Training

Every training node mounts the same volume:

# On each node
mount.flexfs start training-data /mnt/data \
  --diskFolder /local-nvme/cache \
  --diskQuota 50% \
  --noATime

All nodes see the same filesystem. Data loaders on each node read different shards, and the local disk cache captures the working set per node.

Framework Integration

PyTorch DataLoader

PyTorch’s DataLoader with num_workers > 0 spawns multiple reader processes. FlexFS handles concurrent reads from the same mount point:

dataset = ImageFolder('/mnt/data/imagenet/train', transform=transform)
loader = DataLoader(dataset, batch_size=256, num_workers=8, pin_memory=True)

The FUSE allow_other option is enabled by default, allowing all user processes to access the mount.

Checkpoint Saving

Write checkpoints directly to flexFS. For large models, the sequential write throughput is determined by the network path to storage (direct or via proxy):

torch.save(model.state_dict(), '/mnt/checkpoints/epoch_10.pt')

For hybrid cloud deployments with on-prem compute, enable --diskWriteback or use a proxy group to mask write latency.

Hugging Face / Datasets Library

The Hugging Face datasets library uses memory-mapped Arrow files. Mount the cache directory on flexFS for shared access:

import datasets
datasets.config.HF_DATASETS_CACHE = "/mnt/data/hf-cache"

Performance Tips

Pre-shard your data: Use formats like TFRecord or WebDataset instead of millions of small files. This reduces metadata overhead.
Cache sizing: Size the local disk cache to hold at least one epoch’s worth of data per node if possible.
Proxy groups: For multi-region training, deploy proxy groups near GPU clusters to avoid cross-region reads.
Separate volumes: Use different volumes for immutable training data (read-only tokens, long retention) and mutable checkpoints (write access, shorter retention).
Read-only mounts: Mount training data as read-only (--ro) on training nodes to eliminate any write-path overhead.