Why this guide?
Open large-language models (LLMs) have reached a point where you can match—or at least approach—proprietary API quality without handing your data to someone else. But the catalogue is crowded and the marketing noise is loud. Before we start, let’s make one distinction:
Open-source model code and data are publicly available—can be viewed, modified, and reused. You can reproduce the training process from scratch, modify any aspect of the system, and fully understand how the model was created.
Open-weight pretrained model weights are available to run or fine-tune—code, training data, and architectural details may not be available to the public.
This blog will focus on both types of models.
1. How to read the numbers
2. Tier-by-tier tour
🥾 Hobby & edge tier (≤ 10 B params, ≤ 8 GB VRAM)
Use when: privacy trumps quality, you need sub-second latency on-device, or you’re cost-constrained (< $0.30/h on serverless).
⚙️ Sweet-spot single-GPU tier (15 – 50 B)
Why this tier? It’s the smallest class that regularly beats GPT-3.5, yet fits in a $2.5 k workstation. If you’re migrating away from OpenAI for cost/privacy but still want “wow” moments, start here.
🚀 Big-single-card & dual-GPU tier (50 – 90 B active)
Rule of thumb: If you can amortise $10–30 k of hardware over many users—or pass the cost to clients—these models give near-GPT-4o quality while keeping your data on-prem.
🏢 Cluster-scale giants (≥ 100 B active)
Mostly for research labs, cloud providers, or competitive benchmark chasing.
Unless you already own an InfiniBand cluster, you can safely ignore this tier.
3. Don’t forget the classics: classification models
When you only need embeddings or text classification, 2019-vintage transformers still shine:
- DistilBERT-base (66 M params, 1 GB VRAM) — 97 % of BERT quality at one-third the size.
- BERT-large / RoBERTa-large — gold standards for sentiment, intent, Named Entity Recognition.
- DeBERTa-v3-large — State–Of-The-Art on many General Language Understanding Evaluation tasks, trades VRAM (4.5 GB) for F1 points.
They cost pennies to run (≈ $0.05/h on a cheap T4 pod) and stay robust in production.
4. Licensing quick-look
- MIT / Apache-2.0 (Phi-3, Mistral, Mixtral, Llama-3, DistilBERT…) — do almost anything, including SaaS resale.
- Meta Community (Llama-3) — free unless your product serves ≥ 700 M MAU; then you need a licence.
- Google Gemma — commercial use allowed but must follow each provider’s Acceptable Use Policy.
- Research-only (Qwen 72B) — no-go for paid products.
- TII Falcon — free to host, but SaaS requires a notice and some attribution.
Always double-check downstream obligations (copyright, trademark, safety guardrails) before shipping.
5. Chain-of-thought & prompting tips
Every model in the table can emit reasoning traces if you ask:
“Let’s think step by step.”
Quality varies—Phi-class models need coaxing while Mixtral/Llama-3 often self-reflect unprompted. For code, prepend:
You are an expert software engineer. ###
and add “### End” after your request to curb rambling.
6. Deployment instructions
If you're looking to serve high-throughput, low-latency inference with large language models like Mistral 24B, vLLM is the perfect backend. In this post, I’ll walk through how to deploy the Mistral-Small-3.2-24B-Instruct-2506 model on an RTX 6000 Ada GPU using the vllm-openai container image.
✅ This setup uses:
- vLLM with OpenAI-compatible API
- RTX 6000 Ada GPU (48GB VRAM)
- INT4 quantized model (GPTQ)
- 32K token context window
🧱 Create a Deployment Template
Use your preferred orchestration platform (e.g., Modal, RunPod, Lambda Cloud, etc.). Here's an example pod template configuration.
🔧 General Settings
▶️ Container Start Command
--host 0.0.0.0 \
--port 8000 \
--model dwetzel/Mistral-Small-3.2-24B-Instruct-2506-GPTQ-INT4 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--api-key <key>
🔍 Notes:
- --host 0.0.0.0: Exposes the service on all interfaces.
- --model: Replace with your HuggingFace path or preloaded model.
- --gpu-memory-utilization 0.90: Uses up to 90% of VRAM to maximize performance while preventing OOM.
- --max-model-len 32768: Enables 32K context length, great for summarization and long chats.
- --api-key <key>: Set a real API key or secure this via environment variables.
💾 Storage Configuration
Ensure you have enough disk space for:
- Model weights (~12–18 GB for 24B GPTQ)
- Tokenizer + config files
- Any downloaded dependencies
🖥️ Hardware Considerations
RTX 6000 Ada offers 48 GB of VRAM, which is sufficient for running Mistral-24B in INT4 format with vLLM. Expect:
- ✅ Smooth inference with 32K context
- ✅ Fast token generation
- ❗ GPU must be idle or clean before launch, especially with high memory use
🔐 Security Best Practice
Avoid hardcoding --api-key values directly in your container command for public or shared pods. Use environment variables or secrets management instead:
--api-key $API_KEY
And inject API_KEY using your platform's secret manager.
✅ Test the Deployment
Once your pod is up and running, test it with an OpenAI-compatible client like:
curl http://<your-host>:8000/v1/completions \
-H "Authorization: Bearer<key>" \
-H "Content-Type: application/json" \
-d '{
"model": "dwetzel/Mistral-Small-3.2-24B-Instruct-2506-GPTQ-INT4",
"prompt": "Explain transformers in 3 bullet points.",
"max_tokens": 300
}'
7. Final recommendations
- Start small, scale later. Spin up Mistral 7B or Llama-3 8B on a cheap pod; measure latency & accuracy against your real workload.
- MoE is your friend. Mixtral models punch above their VRAM weight, especially under concurrent load.
- Budget for context. VRAM rules of thumb assume 4 k tokens. If you need 32 k, double the headroom or quantise more aggressively.
- Mind the ecosystem. Tooling (vLLM, TGI, Ollama, OpenLLM, BentoML) determines day-two Ops pain more than the raw model.
- Keep an eye on Llama-3 fine-tunes. Community checkpoints are closing the final few percent to GPT-4o every month.
The bottom line
- Hobbyists & edge devices: Phi-3 mini is a revelation—privacy-first chat under 4 GB.
- Small teams: A single 4090 plus LLama-3 22B or Mixtral 8×7B gets you fair quality for <$1/h, no data leaks.
- Enterprises: Llama-3 70B or Mixtral 8×22B give strong multilingual reasoning with tractable TCO, and an Apache-friendly licence.
- Researchers: DeepSeek R1 671B and Qwen 2.5 72B exist, but your power bill may outpace your curiosity.
Self-hosting is no longer a bragging right; it’s a viable alternative business model. Choose a tier, match the hardware, and enjoy owning your stack.
Sources
- https://arxiv.org/abs/2404.14219
- https://huggingface.co/datasets/BAAI/Infinity-Instruct
- https://huggingface.co/QuantFactory/Llama-3-Smaug-8B-GGUF
- https://arstechnica.com/information-technology/2024/04/meta-releases-chatgpt-like-ai-site-and-open-weights-llama-3-model
- https://ollama.com/ordis/smaug-llama-3?utm_source=chatgpt.com
- https://qwenlm.github.io/blog/qwen2.5-llm/#:~:text=5%20is%20capable%20of%20generating,to%20Qwen2%2D72B%2DInstruct.
- https://blog.paperspace.com/introducing-falcon/
- https://api-docs.deepseek.com/updates/
- https://docs.mistral.ai/getting-started/models/benchmark/