By clicking “Accept”, you agree to the storing of cookies and pixels on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
blog-cover-self-hosted-llm

The 2025 Self-Hosting Field Guide to Open LLMs

Yevhenii Hordiienko - img
Yevhenii Hordiienko
July 2025

Why this guide?

Open large-language models (LLMs) have reached a point where you can match—or at least approach—proprietary API quality without handing your data to someone else. But the catalogue is crowded and the marketing noise is loud. Before we start, let’s make one distinction:

Open-source model code and data are publicly available—can be viewed, modified, and reused. You can reproduce the training process from scratch, modify any aspect of the system, and fully understand how the model was created.

Open-weight pretrained model weights are available to run or fine-tune—code, training data, and architectural details may not be available to the public.

This blog will focus on both types of models.

1. How to read the numbers

Column What it really means Gotchas
Params (B) Total parameters. For Mixture-of-Experts (MoE) models the active parameters per token are lower. Check the "active" figure—only this number of parameters is used during inference for any given input.
VRAM Minimum GPU memory to run the model with precision of INT4 or FP16. Leave headroom for the operating system, tokenizer and longer contexts.
Typical GPU setup & Hardware cost What you'd buy on eBay or from a cloud spot market today. Prices fluctuate; treat figures as ball-park.
Licence What you can legally do. Read the fine print before commercial deployment.

2. Tier-by-tier tour

🥾 Hobby & edge tier (≤ 10 B params, ≤ 8 GB VRAM)

Model Why consider it Watch-outs
Phi-3 mini (3.8 B) Runs on CPU with < 8 GB RAM. MIT licence. Great for embedded chat or retrieval-augmented-generation (RAG) prototypes. Way behind GPT-4o; you'll feel it on reasoning.
Mistral 7B / Llama-3 8B / Gemma 7B / Phi-3 small (7 B) Fit on any RTX 3060 or 4060-class card. Apache-2.0 or similar licences. Solid coding and Q&A baseline. Still way behind GPT-4o; hallucinations creep in when tasks get long.

Use when: privacy trumps quality, you need sub-second latency on-device, or you’re cost-constrained (< $0.30/h on serverless).

⚙️ Sweet-spot single-GPU tier (15 – 50 B)

Model Strength Ideal hardware
Llama-3 22B (dense) Balanced reasoning, multilingual. Meta Community licence. One RTX 4090 (24 GB) or A6000 (48 GB).
Mixtral 8×7B (MoE) MoE gives near-GPT-4-class quality at 22 GB VRAM. Apache-2.0. Same as above; shines with multi-user loads.
Mistral-Small-3. 2-24B-Instruct Long context, coding bias. Apache-2.0. 4090 is enough; A6000 if you want 48 GB+ context.

Why this tier? It’s the smallest class that regularly beats GPT-3.5, yet fits in a $2.5 k workstation. If you’re migrating away from OpenAI for cost/privacy but still want “wow” moments, start here.

🚀 Big-single-card & dual-GPU tier (50 – 90 B active)

Model What it's good at Cost reality
Llama-3 70B General chat, multilingual, RAG. Single-GPU with INT4. Needs 40 GB VRAM; an A6000 (≈ $7 k) or rented A100 80 GB.

Rule of thumb: If you can amortise $10–30 k of hardware over many users—or pass the cost to clients—these models give near-GPT-4o quality while keeping your data on-prem.

🏢 Cluster-scale giants (≥ 100 B active)

Mostly for research labs, cloud providers, or competitive benchmark chasing.

Model Notes
Qwen 72B (FP16) Excellent Chinese/English mix, but licence forbids commercial use.
Mixtral 8×22B 44 B active—stunning quality/latency if you have NVLink.
DeepSeek R1 671B Metrics curiosity; only distilled checkpoints are practical.

Unless you already own an InfiniBand cluster, you can safely ignore this tier.

3. Don’t forget the classics: classification models

When you only need embeddings or text classification, 2019-vintage transformers still shine:

  • DistilBERT-base (66 M params, 1 GB VRAM) — 97 % of BERT quality at one-third the size.
  • BERT-large / RoBERTa-large — gold standards for sentiment, intent, Named Entity Recognition.
  • DeBERTa-v3-large — State–Of-The-Art on many General Language Understanding Evaluation tasks, trades VRAM (4.5 GB) for F1 points.

They cost pennies to run (≈ $0.05/h on a cheap T4 pod) and stay robust in production.

4. Licensing quick-look

  • MIT / Apache-2.0 (Phi-3, Mistral, Mixtral, Llama-3, DistilBERT…) — do almost anything, including SaaS resale.
  • Meta Community (Llama-3) — free unless your product serves ≥ 700 M MAU; then you need a licence.
  • Google Gemma — commercial use allowed but must follow each provider’s Acceptable Use Policy.
  • Research-only (Qwen 72B) — no-go for paid products.
  • TII Falcon — free to host, but SaaS requires a notice and some attribution.

Always double-check downstream obligations (copyright, trademark, safety guardrails) before shipping.

5. Chain-of-thought & prompting tips

Every model in the table can emit reasoning traces if you ask:

“Let’s think step by step.”

Quality varies—Phi-class models need coaxing while Mixtral/Llama-3 often self-reflect unprompted. For code, prepend:

You are an expert software engineer. ###

and add “### End” after your request to curb rambling.

6. Deployment instructions

If you're looking to serve high-throughput, low-latency inference with large language models like Mistral 24B, vLLM is the perfect backend. In this post, I’ll walk through how to deploy the Mistral-Small-3.2-24B-Instruct-2506 model on an RTX 6000 Ada GPU using the vllm-openai container image.

✅ This setup uses:

  • vLLM with OpenAI-compatible API
  • RTX 6000 Ada GPU (48GB VRAM)
  • INT4 quantized model (GPTQ)
  • 32K token context window

🧱 Create a Deployment Template

Use your preferred orchestration platform (e.g., Modal, RunPod, Lambda Cloud, etc.). Here's an example pod template configuration.

🔧 General Settings

Field Value
Name mixtrall_pod
Type Pod
Compute Nvidia GPU
Visibility Private or Public (your choice)
Container Image vllm/vllm-openai:latest

▶️ Container Start Command

--host 0.0.0.0 \
--port 8000 \
--model dwetzel/Mistral-Small-3.2-24B-Instruct-2506-GPTQ-INT4 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--api-key <key>

🔍 Notes:

  • --host 0.0.0.0: Exposes the service on all interfaces.
  • --model: Replace with your HuggingFace path or preloaded model.
  • --gpu-memory-utilization 0.90: Uses up to 90% of VRAM to maximize performance while preventing OOM.
  • --max-model-len 32768: Enables 32K context length, great for summarization and long chats.
  • --api-key <key>: Set a real API key or secure this via environment variables.

💾 Storage Configuration

Disk Type Size Purpose
Container Disk 30 GB Temporary storage for runtime
Volume Disk 60 GB Persistent storage for model weights & artifacts
Mount Path /workspace Mounted inside the container

Ensure you have enough disk space for:

  • Model weights (~12–18 GB for 24B GPTQ)
  • Tokenizer + config files
  • Any downloaded dependencies

🖥️ Hardware Considerations

RTX 6000 Ada offers 48 GB of VRAM, which is sufficient for running Mistral-24B in INT4 format with vLLM. Expect:

  • Smooth inference with 32K context
  • Fast token generation
  • GPU must be idle or clean before launch, especially with high memory use

🔐 Security Best Practice

Avoid hardcoding --api-key values directly in your container command for public or shared pods. Use environment variables or secrets management instead:

--api-key $API_KEY

And inject API_KEY using your platform's secret manager.

✅ Test the Deployment

Once your pod is up and running, test it with an OpenAI-compatible client like:

curl http://<your-host>:8000/v1/completions \
  -H "Authorization: Bearer<key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "dwetzel/Mistral-Small-3.2-24B-Instruct-2506-GPTQ-INT4",
    "prompt": "Explain transformers in 3 bullet points.",
    "max_tokens": 300
  }'

7. Final recommendations

  1. Start small, scale later. Spin up Mistral 7B or Llama-3 8B on a cheap pod; measure latency & accuracy against your real workload.
  2. MoE is your friend. Mixtral models punch above their VRAM weight, especially under concurrent load.
  3. Budget for context. VRAM rules of thumb assume 4 k tokens. If you need 32 k, double the headroom or quantise more aggressively.
  4. Mind the ecosystem. Tooling (vLLM, TGI, Ollama, OpenLLM, BentoML) determines day-two Ops pain more than the raw model.
  5. Keep an eye on Llama-3 fine-tunes. Community checkpoints are closing the final few percent to GPT-4o every month.

The bottom line

  • Hobbyists & edge devices: Phi-3 mini is a revelation—privacy-first chat under 4 GB.
  • Small teams: A single 4090 plus LLama-3 22B or Mixtral 8×7B gets you fair quality for <$1/h, no data leaks.
  • Enterprises: Llama-3 70B or Mixtral 8×22B give strong multilingual reasoning with tractable TCO, and an Apache-friendly licence.
  • Researchers: DeepSeek R1 671B and Qwen 2.5 72B exist, but your power bill may outpace your curiosity.

Self-hosting is no longer a bragging right; it’s a viable alternative business model. Choose a tier, match the hardware, and enjoy owning your stack.

Sources

  1. https://arxiv.org/abs/2404.14219
  2. https://huggingface.co/datasets/BAAI/Infinity-Instruct
  3. https://huggingface.co/QuantFactory/Llama-3-Smaug-8B-GGUF
  4. https://arstechnica.com/information-technology/2024/04/meta-releases-chatgpt-like-ai-site-and-open-weights-llama-3-model
  5. https://ollama.com/ordis/smaug-llama-3?utm_source=chatgpt.com
  6. https://qwenlm.github.io/blog/qwen2.5-llm/#:~:text=5%20is%20capable%20of%20generating,to%20Qwen2%2D72B%2DInstruct.
  7. https://blog.paperspace.com/introducing-falcon/
  8. https://api-docs.deepseek.com/updates/
  9. https://docs.mistral.ai/getting-started/models/benchmark/