Menu
    AiStor for AI/ML Training Data: Why Your Model Pipeline Shouldn't Start in S3

    AiStor for AI/ML Training Data: Why Your Model Pipeline Shouldn't Start in S3

    Bit Refinery TeamFebruary 26, 20266 min read

    Everybody starts in S3. It's the path of least resistance — you've got an AWS account, you dump your training data in a bucket, and you're off to the races. Fine for a prototype. But somewhere between your first successful training run and your fifth production model, something starts to feel wrong.

    The jobs are slow. The bills are climbing. Your data engineers are complaining about egress charges every time someone pulls a dataset to a new region. And your GPU cluster is sitting idle half the time waiting on data.

    This is the S3 trap. And it's more common than most teams want to admit.

    The Real Problem With S3 for Training Data

    Cost comparison chart between AWS S3 and AiStor for AI training data

    S3 is a great general-purpose object store. It's durable, it's accessible, and it integrates with basically everything. But it wasn't designed for the access patterns that AI/ML workloads demand — and that mismatch creates real, measurable pain.

    Throughput is the first issue. Large model training jobs need to stream massive datasets — we're talking hundreds of gigabytes to terabytes per run. S3's throughput is... fine. But "fine" doesn't cut it when you've got 8x H100s burning $30/hour waiting on data to load. The I/O bottleneck becomes the bottleneck, full stop.

    Egress fees are the second issue, and honestly the more insidious one. Every time your training cluster pulls data from S3 — especially if that cluster isn't in the exact same region — you're paying. At $0.09/GB, moving 50TB of training data in a month adds up to $4,500 in egress alone. That's before compute. Before storage. Before anything else.

    And then there's the latency question. Random access to small files (think image datasets with millions of individual JPEGs) hammers S3 in ways that flat throughput benchmarks don't capture. The per-request overhead adds up fast when you're doing millions of reads per training epoch.

    What AiStor Actually Is

    AiStor is MinIO's commercial platform built specifically for AI/ML workloads at scale — and we deploy it on bare metal at Bit Refinery. That last part matters a lot.

    At its core, AiStor gives you S3-compatible object storage with a few key differences from vanilla S3:

    • It runs on your hardware (or our hardware, depending on how you look at it), which means no egress fees when your training cluster is co-located
    • Throughput is dramatically higher — we're talking multi-hundred GB/s on properly configured NVMe clusters, not the shared, rate-limited throughput you get from a public cloud bucket
    • It's built for the access patterns AI workloads actually have — large sequential reads, parallel data loading across multiple workers, metadata-heavy operations on massive file counts

    The S3 compatibility piece is genuinely important. You don't have to rewrite your data loaders. PyTorch's DataLoader, Hugging Face datasets, Ray Data, NVIDIA DALI — they all speak S3. Point them at AiStor instead and they just work.

    The Co-location Angle

    Here's where things get interesting for teams running their own GPU hardware. If you're using our BYOGPU service — shipping your H100s or A100s to our Denver or Seattle facility — your training cluster and your storage can live on the same high-speed internal network.

    No egress. No cross-region hops. No throttling from a shared cloud fabric.

    Instead you get bare-metal NVMe storage connected to your GPUs over 100GbE. Data loading stops being the bottleneck. GPU utilization goes up. Training jobs finish faster. The math on that is pretty compelling when you're paying for GPU time by the month.

    Even if you're not co-locating GPUs, running AiStor on dedicated storage hardware in the same facility as your compute — whether that's on-prem or at a colo — eliminates the egress problem entirely.

    Concrete Numbers Worth Thinking About

    Let's say you've got a team doing serious ML work: 100TB of training data, multiple active experiments, a few training runs per week pulling 10-20TB each.

    On AWS S3:

    • Storage: ~$2,300/month (at $0.023/GB)
    • Egress (assuming 60TB/month out): ~$5,400/month
    • API request costs: another few hundred dollars depending on access patterns
    • Total: $7,000-8,000/month just for storage and data movement

    On AiStor at Bit Refinery:

    • Dedicated bare-metal storage node with NVMe, configured for your workload
    • $0 egress when your compute is co-located
    • Predictable flat monthly pricing
    • Typically 60-70% less than the equivalent cloud configuration

    And that's before you factor in the performance difference, which translates to faster iteration cycles and better GPU utilization.

    It's Not Just About Cost

    I want to be clear that this isn't purely a cost argument, even though the numbers are pretty damning. There's a real performance story here too.

    Data pipelines that bottleneck on storage slow down your entire ML iteration cycle. Slower iteration means fewer experiments. Fewer experiments means slower model improvement. In competitive ML environments — whether you're building internal tooling or shipping a product — iteration speed is everything.

    AiStor on bare metal with NVMe gives you storage that can actually keep up with modern GPU clusters. That's not a small thing.

    There's also a data governance angle. Keeping your training data on infrastructure you control (or that's explicitly dedicated to you) is a lot cleaner from a compliance and security standpoint than shared cloud storage. Healthcare data, financial data, anything with PII — the regulatory picture is simpler when you're not on a shared multi-tenant platform.

    When S3 Is Still Fine

    Look, I'm not going to pretend S3 is always the wrong answer. For small teams, early-stage projects, or workloads where data volumes are modest and egress is minimal, S3 is totally reasonable. The convenience is real.

    But if you're running production ML infrastructure, doing regular large-scale training runs, or managing datasets measured in tens of terabytes — the S3 convenience premium starts to look a lot less convenient when you're staring at the monthly bill.

    The Bottom Line

    Your model pipeline starts with data. If that data lives in a place that's slow, expensive to move, and not optimized for the access patterns your training jobs need — everything downstream suffers.

    AiStor on bare metal isn't a perfect fit for every team. But for organizations doing serious AI/ML work, it's worth a hard look at what you're actually spending on S3 and what you're leaving on the table in terms of performance.

    We help teams figure this out. If you want to talk through your current storage setup and whether a dedicated AiStor deployment makes sense, reach out to the Bit Refinery team — we're happy to dig into the numbers with you.

    Ready to Get Started?

    Contact us to learn more about our bare metal and GPU hosting solutions.