What models can I run?

Any open-source or open-weight model available on Hugging Face or other model registries. This includes Qwen, Llama, Mistral, Gemma, DeepSeek, Phi, and hundreds more. You can also upload your own custom or fine-tuned models directly through Foundry. We deploy GGUF, GPTQ, AWQ, and SafeTensors formats.

How fast is inference?

GPU-accelerated deployments deliver 50–150+ tokens/sec on 7B–14B models — fast enough for real-time chat with streaming responses. Larger models (30B–70B) run at 20–60 tokens/sec. Exact speed depends on model size, quantization, and concurrent load. CPU-only servers are also available for lighter workloads or cost-sensitive deployments.

How does this compare to running LLMs on AWS / GCP / Azure?

Cloud providers charge for compute (GPU instances), storage, and egress. A single A100 GPU instance on AWS costs $3–5/hour ($2,200–$3,600/mo). Add egress fees for inference traffic and you're looking at significantly higher costs. Bit Refinery offers dedicated hardware with GPU acceleration, zero egress, and fixed monthly pricing.

How does this compare to OpenAI / Anthropic / Google APIs?

Cloud APIs charge per token — which works at low volume but becomes expensive fast. At 500K tokens/day, GPT-4o costs ~$2,250/mo. Bit Refinery charges a flat monthly rate with unlimited tokens. The trade-off: cloud APIs offer frontier models. Bit Refinery offers open-source models that handle 80–90% of focused enterprise tasks at a fraction of the cost with complete data privacy.

Is this HIPAA / SOC 2 compliant?

Our data centers maintain SOC 2 Type II, HIPAA, PCI DSS Level 1, ISO 27001, and FedRAMP certifications. Private LLM hosting on dedicated hardware with private networking, encryption at rest (AES-256), and encryption in transit (TLS 1.3) provides the infrastructure controls required for regulated workloads.

Can I run multiple models?

Yes. Deploy, start, and stop models on demand through Foundry's model library. Run different models for different tasks, A/B test models, or build multi-step agent pipelines with specialized models at each step.

What if I need help choosing a model?

Our team recommends models based on your use case, deploys and configures the inference stack, and provides ongoing support. We handle the infrastructure — you focus on building your application.

Can I fine-tune models?

Yes. We can help architect a setup that supports both training and serving, or deploy models you've already fine-tuned elsewhere.

What's the minimum commitment?

Month-to-month. No long-term contracts required on any deployment.

Every deployment is custom-built. Timeline depends on your specific requirements — contact us and we'll scope it out. Once provisioned, you get immediate access to Foundry to deploy models, generate API keys, and start building.

How do I manage my deployment?

Every deployment includes access to Foundry, Bit Refinery's management portal. From your browser you can browse and deploy models, generate and rotate API keys, restrict access with an IP whitelist, test models in a built-in chat playground, and monitor usage metrics and server health. No CLI or DevOps experience required.

Do I own my data and models?

Yes. Your data, your models, your fine-tuned weights — they're yours. We provide infrastructure. If you leave, you take everything with you.

PRIVATE LLM HOSTING

Run AI Models on Your Own Infrastructure

Deploy open-source language models on your own dedicated server in our data centers. Full root access. Your data never leaves. Zero egress fees. No per-token API costs.

99.99% Uptime SLA

$0 Egress Fees

Data Never Leaves

Denver & Seattle DCs

PRIVATE LLM INFERENCE

SECURE

MODEL: QWEN3-8B127 tok/s

RESPONSE STREAM

Hello

GPU87%

VRAM5.2GB

LATENCY12ms

EGRESS$0

What is Private LLM Hosting?

Private LLM hosting means running open-source AI language models — like Llama, Qwen, Mistral, and Gemma — on dedicated hardware that you control. Instead of sending every prompt to a cloud API and paying per token, your model runs on infrastructure in our data center with dedicated compute, memory, and GPU resources. Your data never touches a third-party API. There are no per-token charges, no rate limits, and no surprise bills.

Data Privacy

Every prompt sent to OpenAI, Anthropic, or Google touches external servers. For companies handling sensitive data, that's a compliance risk.

Cost Predictability

Per-token pricing scales unpredictably. A single busy chatbot can generate thousands of dollars in monthly API costs.

No Vendor Lock-in

Open-source models run on standard hardware. Switch models anytime without changing providers.

Zero Egress Fees

Inference responses are data transfer. On AWS or GCP, that's metered egress. On Bit Refinery, it's $0.

72%

of enterprises cite data privacy as the top barrier to LLM adoption

Kong 2025

per-token cost — fixed monthly pricing regardless of usage

egress fees on all inference traffic

How It Works

Tell Us What You Need

Share your use case — which models you want to run, how many users, what compliance requirements. We'll design a deployment tailored to your workload and budget.

Choose Your Model

Select from any open-source or custom model — Qwen, Llama, Mistral, Gemma, DeepSeek, Phi, and hundreds more, or bring your own fine-tuned model. We deploy it using production-grade inference engines like vLLM, Ollama, or llama.cpp.

Get Your Dedicated Server

Your dedicated server is custom-built for your workload — with or without GPU. Full root access, a managed firewall, your choice of inference engine, and an OpenAI-compatible API endpoint ready to go.

Scale When You Need To

Add models. Increase capacity. Move to multi-model deployments. Scale down if your needs change. Month-to-month, no long-term contracts.

Choose the Right Model for Your Workload

You don't need a 400-billion-parameter frontier model for most business tasks. Modern small language models in the 7B–30B range deliver 80–90% of frontier model quality on focused tasks — at a fraction of the cost.

GENERAL-PURPOSE REASONING & CHAT

Qwen3-8B

8B parameters

The current best all-around small model. Dual-mode architecture supports both fast dialogue and deep reasoning within a single model. 131K token context window. Ideal for internal copilots, customer support, and general Q&A.

~6 GB (Q4) · 50–150 tokens/sec (GPU)

CODE GENERATION & REVIEW

DeepSeek-R1-Distill-Qwen-7B

7B parameters

Distilled from DeepSeek-R1 with exceptional reasoning capabilities. 92.8% accuracy on MATH-500 and strong code generation benchmarks. Purpose-built for development workflows, code review automation, and technical documentation.

~5 GB (Q4) · 50–150 tokens/sec (GPU)

FINE-TUNING & CUSTOM MODELS

Mistral 7B

7B parameters

The most fine-tuning-friendly open architecture. Widely used as a base model for domain-specific SLMs in legal, medical, financial, and security applications. Large ecosystem of adapters and tooling.

~5 GB (Q4) · 50–150 tokens/sec (GPU)

DOCUMENT PROCESSING & SUMMARIZATION

Gemma 4 27B

27B parameters (Google)

Google's latest open model (Apache 2.0). Per-layer embedding architecture delivers frontier-class summarization, document analysis, and structured extraction while fitting on a single consumer GPU. Strong multilingual support.

~20 GB (Q4) · 20–60 tokens/sec (GPU)

MULTIMODAL (VISION + TEXT)

Qwen2.5-VL-7B-Instruct

7B parameters

Processes images and video alongside text. Useful for document OCR, image classification, visual inspection, and any workflow where the model needs to "see" input.

~6 GB (Q4) · 30–80 tokens/sec (GPU)

ENTERPRISE-SCALE REASONING

Qwen3 32B

32B parameters

The strongest open model in the 30B class. Deeper reasoning, more nuanced output, larger context handling. For workloads where 8B models aren't quite enough.

~20 GB (Q4) · 20–60 tokens/sec (GPU)

CUSTOM & FINE-TUNED MODELS

Bring Your Own Model

Any size

Deploy models you've fine-tuned in-house or trained from scratch. We support GGUF, GPTQ, AWQ, and SafeTensors formats from Hugging Face or your own registry. Your weights stay on your hardware.

Varies by model · We handle deployment & optimization

MULTILINGUAL APPLICATIONS

Qwen 2.5 (7B / 14B / 32B)

Best non-English language support across major open models. Strong for multilingual customer support, translation workflows, and global enterprise deployments.

SECURITY & VULNERABILITY ANALYSIS

Custom SLMs / Fine-tuned Models

Task-specific small language models for cybersecurity — vulnerability discovery, code auditing, threat intelligence. Bring your own model, we provide the infrastructure.

Custom-Built for Your Workload

Every deployment is different. We build a private LLM setup tailored to your models, your users, and your budget — with on a dedicated server with full root access, $0 egress fees, and GPU acceleration available based on your workload.

Every deployment includes:

Dedicated server — your hardware, your workload only

Full root and SSH access

Managed firewall included

GPU acceleration available (50–150+ tokens/sec on 7B–14B models)

OpenAI-compatible API endpoint

Managed deployment, monitoring, and updates

$0 egress fees — unlimited bandwidth included

99.99% uptime SLA

Month-to-month — no long-term contracts

Hosted in Denver or Seattle data centers

Configuration	Models	Use Case
Single Model	1 model (7B–14B)	Small team copilot, internal chatbot, dev/test
Multi-Model	2–5 models (7B–30B)	Production apps, multiple departments, A/B testing
Enterprise	5+ models (7B–70B)	Organization-wide AI, RAG pipelines, compliance workloads
Full Stack	Unlimited + managed services	LLM + ClickHouse + MinIO + GCP Interconnect

All configurations are custom. Contact us for a quote based on your specific requirements.

Bit Refinery vs. RunPod

RunPod is a popular cloud GPU platform. Here's how private LLM hosting on Bit Refinery compares for teams running persistent AI inference workloads.

Feature	Bit Refinery	RunPod
Pricing model	Fixed monthly — custom quote	Per-second / per-hour GPU rental
Cost at scale (24/7)	Predictable flat rate	RTX 4090 ~$0.44/hr = ~$317/mo; H100 ~$3.35/hr = ~$2,412/mo
Egress fees	$0 — unlimited bandwidth	$0 on network storage; standard egress elsewhere
Hardware	Dedicated server — full root access, your workload only	Shared cloud GPUs
Uptime SLA	99.99%	99.9%
Data residency	Denver, CO and Seattle, WA	31 regions — varies by availability
Management	Fully managed — we deploy, monitor, update	Self-service — you manage containers
Compliance	SOC 2, HIPAA, PCI DSS, ISO 27001, FedRAMP	SOC 2 Type II
GCP Interconnect	Free private peering (Denver)	Not available
Best for	Managed, private, predictable-cost AI inference	Self-service elastic cloud GPU compute

The key difference: RunPod is a self-service cloud GPU platform — great for developers who want to manage their own infrastructure. Bit Refinery is a fully managed private AI hosting service — we handle the deployment, monitoring, and operations so you just use the API. If you need compliance certifications, dedicated hardware, and predictable costs without managing infrastructure, Bit Refinery is built for that.

RunPod pricing based on published rates at runpod.io/pricing as of April 2026. Bit Refinery pricing is custom — contact us for a quote.

Cost Comparison

Three ways to run AI inference. Here's how they compare at real-world token volumes.

vs. Cloud LLM APIs (per-token pricing)

Daily Tokens	GPT-4o/mo	GPT-4o Mini/mo	Bit Refinery/mo
500K	~$2,250	~$225	Fixed monthly rate
2M	~$9,000	~$900	Fixed monthly rate
10M	~$45,000	~$4,500	Fixed monthly rate
50M	~$225,000	~$22,500	Fixed monthly rate

Cloud API pricing based on published per-token rates as of April 2026. Bit Refinery pricing is fixed regardless of token volume.

vs. RunPod Cloud GPU (24/7 inference)

GPU	RunPod Hourly	RunPod 24/7 Monthly	Bit Refinery
RTX 4090 (24 GB)	~$0.44/hr	~$317/mo	Custom quote — dedicated, fully managed
L4 (24 GB)	~$0.24/hr	~$173/mo	Custom quote — dedicated, fully managed
A100 80 GB	~$1.64/hr	~$1,181/mo	Custom quote — dedicated, fully managed
H100 SXM	~$3.35/hr	~$2,412/mo	Custom quote — dedicated, fully managed

RunPod pricing from runpod.io/pricing (Community Cloud, April 2026). Does not include storage costs ($0.07–$0.20/GB/mo). Bit Refinery includes storage, management, monitoring, and $0 egress.

RunPod charges per hour. Cloud APIs charge per token. Bit Refinery charges a flat monthly rate. At low or bursty usage, per-hour pricing can be cheaper. At sustained 24/7 inference — which is what most production AI workloads look like — fixed monthly pricing eliminates billing surprises and typically costs less when you factor in management overhead, egress, and storage.

Feature Comparison at a Glance

	Cloud APIs	Cloud GPUs	Bit Refinery
Pricing	Per token	Per hour/second	Fixed monthly
Data privacy	Data sent to provider	Shared infrastructure	Dedicated server, full access
Management	None needed (API)	Self-service	Fully managed
Model choice	Provider's models only	Any (you deploy)	Any (we deploy)
Egress fees	N/A	Varies	$0
Compliance	Varies	SOC 2	SOC 2, HIPAA, PCI, ISO
Uptime SLA	Varies	99.9%	99.99%
Best for	Low volume, elastic	Dev/test, experimentation	Production, compliance

Built for Real Workloads

Internal AI Copilot

Deploy a private ChatGPT alternative for your team. Employees ask questions, generate content, and get coding assistance — without sending proprietary data to external APIs.

Customer Support Automation

Power chatbots and support ticket triage with a fine-tuned SLM. Process thousands of customer interactions daily with consistent, fast responses. Zero per-token cost.

Document Processing Pipeline

Extract, classify, and summarize documents at scale. Medical records, legal contracts, financial reports — process sensitive documents without sending them to external services.

Code Review & Security Scanning

Run specialized code analysis models that scan repositories for vulnerabilities, generate code reviews, and flag security issues. Source code never leaves your infrastructure.

RAG (Retrieval-Augmented Generation)

Combine a private LLM with your own knowledge base. Connect to ClickHouse or Trino for data retrieval, MinIO for document storage — all on Bit Refinery with zero egress between services.

Agentic Workflows & Automation

Multi-step AI tasks like report generation, data entry automation, and tool orchestration. Agents generate 10–100x more tokens — fixed pricing removes the ceiling.

Regulated Industry Compliance

Healthcare (HIPAA), finance (SOC 2, PCI DSS), legal, and government workloads where data residency and access controls are non-negotiable. Private LLM hosting on dedicated hardware in SOC 2 compliant data centers with private networking and encryption at rest.

Production-Grade Inference Stack

We deploy your models using battle-tested open-source inference engines — not experimental tooling.

vLLM

Production-grade serving with continuous batching and PagedAttention for high-concurrency workloads. Best for multi-user API endpoints.

Ollama

Simple deployment with automatic hardware detection and an OpenAI-compatible REST API. Best for development, internal tools, and single-tenant deployments.

llama.cpp

Lightweight C++ runtime optimized for efficient inference. Best for air-gapped environments and maximum hardware efficiency.

Model Formats

GGUF — Universal format for CPU and hybrid CPU/GPU inference with quantization support.
GPTQ / AWQ — GPU-optimized quantization formats for maximum throughput on dedicated GPUs.

Quantization & API

All models deployed with Q4_K_M quantization by default — retaining ~95% of full-precision quality while reducing memory by 75%. Higher precision available on request.

Every deployment exposes an OpenAI-compatible API endpoint. If your application works with the OpenAI SDK, it works with your private Bit Refinery endpoint — just change the base URL.

Why Bit Refinery for Private AI

$0 Egress Fees

Every inference response is data transfer. On AWS, that's $0.09/GB. On GCP, $0.12/GB. On Bit Refinery, it's $0 — unlimited 1 Gbps bandwidth included.

Your Server, Your Rules

You get a dedicated server with full root access — not a shared VM where another tenant's workload degrades your inference latency. Install what you need, configure it how you want.

GPU-Accelerated Inference

GPU acceleration available on every deployment. 50–150+ tokens/sec on 7B–14B models — fast enough for real-time chat, streaming responses, and production applications. CPU-only configurations also available for lighter workloads.

Data Never Leaves

Your prompts, documents, and outputs stay on your hardware in our Tier 3 data centers in Denver or Seattle. No third-party data processing agreements.

Predictable Monthly Pricing

Fixed monthly rate. No per-token charges, no bandwidth overages, no compute-hour surprises. Budget AI infrastructure the same way you budget rent.

Google Cloud Interconnect (Denver)

Every Denver deployment includes free private peering to Google Cloud. Run your LLM on Bit Refinery and connect to BigQuery, Vertex AI, or Cloud Storage over a sub-millisecond private link.

Full Stack Integration

Combine private LLM hosting with Bit Refinery's other managed services for a complete AI infrastructure stack:

Included with every deployment

Foundryby Bit Refinery

Your control plane for private AI. Browse models, manage API keys, secure access.

foundry.bitrefinery.com

Foundry

Models

API keys

IP whitelist

Usage

Playground

Settings

Models

Qwen3-8BLive

8B params · Q4_K_M · General reasoning

Running · 5.8 GB · 142 t/s · 3 active users

DeepSeek-R1-Distill-7BLive

7B params · Q4_K_M · Code generation

Running · 4.9 GB · 156 t/s · 1 active user

Mistral-7B-v0.3

7B params · Q4_K_M · Fine-tuned (custom)

Stopped · 4.6 GB

API endpointhttps://acme-corp.foundry.bitrefinery.com/v1

API key

sk-br-••••••••••••••a4f8

IP whitelist10.0.1.0/24203.0.113.42

Model library

Browse, deploy, or upload your own models

API & keys

OpenAI-compatible endpoint. Generate and rotate keys.

IP whitelist

Restrict API access to only your approved IPs

Chat playgroundUsage metricsServer monitoringAudit logging

No CLI required. No DevOps. Just log in and manage your AI.

Frequently Asked Questions

Ready to Run AI on Your Own Infrastructure?

Get a private LLM deployment — no per-token costs, your data stays yours. Custom-built for your workload.

Trusted since 2008 · 99.99% SLA · Denver & Seattle data centers · $0 egress fees

Run AI Models on Your Own Infrastructure

What is Private LLM Hosting?

Data Privacy

Cost Predictability

No Vendor Lock-in

Zero Egress Fees

How It Works

Tell Us What You Need

Choose Your Model

Get Your Dedicated Server

Scale When You Need To

Choose the Right Model for Your Workload

Qwen3-8B

DeepSeek-R1-Distill-Qwen-7B

Mistral 7B

Gemma 4 27B

Qwen2.5-VL-7B-Instruct

Qwen3 32B

Bring Your Own Model

Qwen 2.5 (7B / 14B / 32B)

Custom SLMs / Fine-tuned Models

Custom-Built for Your Workload

Every deployment includes:

Bit Refinery vs. RunPod

Cost Comparison

vs. Cloud LLM APIs (per-token pricing)

vs. RunPod Cloud GPU (24/7 inference)

Feature Comparison at a Glance

Built for Real Workloads

Internal AI Copilot

Customer Support Automation

Document Processing Pipeline

Code Review & Security Scanning

RAG (Retrieval-Augmented Generation)

Agentic Workflows & Automation

Regulated Industry Compliance

Production-Grade Inference Stack

vLLM

Ollama

llama.cpp

Model Formats

Quantization & API

Why Bit Refinery for Private AI

$0 Egress Fees

Your Server, Your Rules

GPU-Accelerated Inference

Data Never Leaves

Predictable Monthly Pricing

Google Cloud Interconnect (Denver)

Full Stack Integration

MinIO Object Storage

ClickHouse

VergeOS VMs

BYOGPU

Models

Model library

API & keys

IP whitelist

Frequently Asked Questions

What models can I run?

How fast is inference?

How does this compare to running LLMs on AWS / GCP / Azure?

How does this compare to OpenAI / Anthropic / Google APIs?

Is this HIPAA / SOC 2 compliant?

Can I run multiple models?

What if I need help choosing a model?

Can I fine-tune models?

What's the minimum commitment?

How fast is setup?

How do I manage my deployment?

Do I own my data and models?

Ready to Run AI on Your Own Infrastructure?