AI/ML Data Pipelines
This guide covers flexFS configuration for machine learning workflows including training data ingest, model checkpointing, and multi-GPU/multi-node distributed training.
Common AI/ML File Patterns
Section titled “Common AI/ML File Patterns”| Workload | File Sizes | Access Pattern |
|---|---|---|
| Training datasets (ImageNet, etc.) | Millions of small files or large archives | Random read, high IOPS |
| Large datasets (video, medical imaging) | GiB-scale files | Sequential read |
| Model checkpoints | 1-100 GiB | Sequential write, occasional read |
| TFRecord / WebDataset shards | 100 MiB - 1 GiB | Sequential read |
| Logs and metrics | Small, append-only | Sequential write |
Volume Configuration
Section titled “Volume Configuration”For Sharded Datasets (TFRecord, WebDataset)
Section titled “For Sharded Datasets (TFRecord, WebDataset)”configure.flexfs create volume \ --name training-data \ --metaStoreID 1 --blockStoreID 1 \ --blockSize 4MiB \ --compression lz4For Many Small Files (ImageNet-style)
Section titled “For Many Small Files (ImageNet-style)”If your dataset consists of millions of small files (e.g., individual images):
configure.flexfs create volume \ --name image-data \ --metaStoreID 1 --blockStoreID 1 \ --blockSize 256KiB \ --compression lz4The smaller block size reduces wasted space when files are smaller than the block size.
For Checkpoints
Section titled “For Checkpoints”configure.flexfs create volume \ --name checkpoints \ --metaStoreID 1 --blockStoreID 1 \ --blockSize 4MiB \ --compression zstd \ --retention 30dZstd provides better compression ratios than LZ4 for checkpoint data, which is written infrequently and read occasionally.
Mount Configuration
Section titled “Mount Configuration”GPU Training Node
Section titled “GPU Training Node”mount.flexfs start training-data /mnt/data \ --diskFolder /local-nvme/cache \ --diskQuota 80% \ --noATimeKey settings:
--diskFolderon a local NVMe drive absorbs repeated reads of hot data (e.g., the same dataset across multiple epochs).--diskQuota 80%allows the cache to use most of the local SSD.--noATimeeliminates metadata writes from reads.
Multi-Node Distributed Training
Section titled “Multi-Node Distributed Training”Every training node mounts the same volume:
# On each nodemount.flexfs start training-data /mnt/data \ --diskFolder /local-nvme/cache \ --diskQuota 50% \ --noATimeAll nodes see the same filesystem. Data loaders on each node read different shards, and the local disk cache captures the working set per node.
Framework Integration
Section titled “Framework Integration”PyTorch DataLoader
Section titled “PyTorch DataLoader”PyTorch’s DataLoader with num_workers > 0 spawns multiple reader processes. FlexFS handles concurrent reads from the same mount point:
dataset = ImageFolder('/mnt/data/imagenet/train', transform=transform)loader = DataLoader(dataset, batch_size=256, num_workers=8, pin_memory=True)The FUSE allow_other option is enabled by default, allowing all user processes to access the mount.
Checkpoint Saving
Section titled “Checkpoint Saving”Write checkpoints directly to flexFS. For large models, the sequential write throughput is determined by the network path to storage (direct or via proxy):
torch.save(model.state_dict(), '/mnt/checkpoints/epoch_10.pt')For hybrid cloud deployments with on-prem compute, enable --diskWriteback or use a proxy group to mask write latency.
Hugging Face / Datasets Library
Section titled “Hugging Face / Datasets Library”The Hugging Face datasets library uses memory-mapped Arrow files. Mount the cache directory on flexFS for shared access:
import datasetsdatasets.config.HF_DATASETS_CACHE = "/mnt/data/hf-cache"Performance Tips
Section titled “Performance Tips”- Pre-shard your data: Use formats like TFRecord or WebDataset instead of millions of small files. This reduces metadata overhead.
- Cache sizing: Size the local disk cache to hold at least one epoch’s worth of data per node if possible.
- Proxy groups: For multi-region training, deploy proxy groups near GPU clusters to avoid cross-region reads.
- Separate volumes: Use different volumes for immutable training data (read-only tokens, long retention) and mutable checkpoints (write access, shorter retention).
- Read-only mounts: Mount training data as read-only (
--ro) on training nodes to eliminate any write-path overhead.