whichllm: Hardware-Aware LLM Selection Using Evidence-Graded Benchmarks
Hook
Most developers choose local LLMs by parameter count—a 13B model should fit in 16GB VRAM, right? Wrong. The actual memory footprint depends on quantization, context length, batch size, KV cache configuration, and whether your model uses Grouped Query Attention or Mixture-of-Experts routing.
Context
The local LLM ecosystem has exploded with thousands of models on HuggingFace, each claiming to be "state-of-the-art" or "optimized for performance." Developers face an impossible trilemma: which model actually runs on their hardware, which benchmarks matter, and which scores are legitimate versus self-reported marketing. Traditional approaches fail at scale—manually checking VRAM requirements means understanding transformer architecture details, while benchmark comparison requires tracking LiveBench, Chatbot Arena, Aider coding evals, and dozens of other leaderboards that measure different capabilities.
The problem compounds when hardware constraints enter the picture. A MacBook Pro with 64GB unified memory has radically different characteristics than an RTX 4090 with 24GB VRAM, yet model cards rarely specify compatibility beyond vague "GPU recommended" warnings. Developers waste hours downloading multi-gigabyte models only to discover they won't fit in memory, or that quantization destroyed quality, or that the benchmark scores were inherited from a base model rather than actually measured. whichllm solves this by automating hardware detection, calculating architecture-aware memory requirements, and ranking models using evidence-graded benchmarks that reject fabricated claims.
Technical Insight
whichllm's core innovation is its evidence-based scoring system that combines multiple benchmark sources while tagging each score's provenance. When you run whichllm, it queries the HuggingFace API for models compatible with your hardware, then applies a sophisticated ranking algorithm that weighs benchmark results by confidence level. Direct measurements (the model was actually tested) receive full weight, variant scores (tested on a quantized version) get partial credit, base model scores (inherited from the foundation model) are heavily discounted, and self-reported claims are flagged with warnings.
The VRAM estimation engine demonstrates architecture-aware calculations that go far beyond naive parameter counting:
# Simplified version of whichllm's VRAM calculation
def estimate_vram(model_config, quantization="fp16"):
params = model_config['num_parameters']
context_length = model_config.get('max_position_embeddings', 2048)
num_layers = model_config['num_hidden_layers']
hidden_size = model_config['hidden_size']
# Weight memory based on quantization
bytes_per_param = {
'fp16': 2, 'q8': 1, 'q4': 0.5, 'q3': 0.375
}[quantization]
weight_memory = params * bytes_per_param
# KV cache for attention (accounts for GQA)
num_kv_heads = model_config.get('num_key_value_heads',
model_config['num_attention_heads'])
kv_cache_size = 2 * num_layers * num_kv_heads * (hidden_size // model_config['num_attention_heads']) * context_length * 2 # fp16
# Activation memory (batch_size=1 assumption)
activation_memory = hidden_size * context_length * 2 * num_layers
# Overhead (CUDA kernels, fragmentation)
overhead = (weight_memory + kv_cache_size + activation_memory) * 0.2
total_gb = (weight_memory + kv_cache_size + activation_memory + overhead) / (1024**3)
return total_gb
This calculation reveals why a 13B parameter model at Q4 quantization (6.5GB weights) actually requires ~11GB VRAM when you factor in the KV cache for 4K context, activations during generation, and memory fragmentation. For Mixture-of-Experts models like Mixtral, whichllm distinguishes between total parameters (47B) and active parameters (13B), preventing false rejections on hardware that could actually run the model.
The benchmark merging strategy uses recency-aware scoring to prevent older models from dominating rankings. When a new model generation is released (Llama 3 → Llama 3.1 → Llama 3.2), benchmark scores naturally improve, but legacy models retain high scores from older evaluations. whichllm applies temporal decay along model lineages:
# Recency penalty for inherited scores
def apply_recency_penalty(score, model_date, benchmark_date, lineage_depth):
age_months = (benchmark_date - model_date).days / 30
age_penalty = max(0.7, 1.0 - (age_months * 0.05))
lineage_penalty = 0.95 ** lineage_depth # Decay for base model inheritance
return score * age_penalty * lineage_penalty
This ensures that Llama 3.2 8B doesn't get outranked by Llama 2 13B just because the older model has more benchmark coverage. The confidence tagging system actively rejects score pollution—if a fine-tuned model claims the same MMLU score as its base model without providing evidence, whichllm downgrades it to "interpolated" status.
The execution layer leverages uv for isolated environments, enabling one-command model runs without dependency conflicts. When you execute whichllm run mistral-7b-instruct-v0.2, it detects your model format (GGUF vs transformers vs AWQ), creates a project-specific virtual environment, installs the appropriate backend (llama-cpp-python for GGUF, vLLM for AWQ, standard transformers otherwise), downloads the model via HuggingFace Hub, and launches an interactive session. This eliminates the "works on my machine" problem where global Python environments cause library version conflicts.
The hardware detection subsystem uses platform-specific APIs to gather accurate capabilities. On NVIDIA GPUs, it queries nvidia-smi for VRAM and CUDA compute capability. On Apple Silicon, it parses system_profiler to determine unified memory and Neural Engine availability. On AMD, it attempts ROCm detection before falling back to CPU-only mode. This multi-platform support means the same command works whether you're on a Linux workstation with dual A6000s, a MacBook Pro with M3 Max, or a Framework laptop with integrated graphics.
Gotcha
whichllm's dependency on live HuggingFace API queries creates a single point of failure—if HF is down or rate-limiting, the tool falls back to frozen benchmark snapshots that quickly become stale in a fast-moving ecosystem. The project includes cached data to enable offline operation, but this defeats the purpose of real-time benchmark awareness. A model released last week won't appear in your rankings until the cache is manually refreshed.
The VRAM estimation models, while sophisticated, remain estimates rather than measurements. Actual memory consumption varies based on inference engine optimizations (FlashAttention vs standard attention, continuous batching, speculative decoding), operating system overhead, and concurrent processes. The tool assumes clean-slate scenarios where your GPU is dedicated to LLM inference, but real-world developers often run models alongside IDEs, Docker containers, and browser tabs that fragment VRAM. I've seen cases where a model predicted to use 18GB actually triggered OOM errors at 22GB usage due to PyTorch memory allocator behavior. The tool would benefit from post-execution telemetry where it measures actual consumption and refines estimates over time.
The benchmark merging methodology, while evidence-based, embeds opinionated weights about which evaluations matter. If you care primarily about code generation, the current formula may underweight Aider benchmarks relative to general-purpose evals like MMLU. The confidence grading system, though designed to prevent score fabrication, can't detect sophisticated gaming where model creators fine-tune specifically to benchmark tasks without generalizable improvements. There's no mechanism to adjust scoring priorities—you get the tool's opinion of "best" without customization beyond hardware constraints.
Verdict
Use whichllm if you're evaluating local LLMs across diverse hardware configurations, want evidence-based recommendations that reject marketing hype, or need to quickly prototype with models that actually fit your VRAM constraints. It excels at hardware purchase planning ("will a 4070 Ti Super run Mixtral 8x7B?"), team environments where standardizing on proven models matters, and rapid experimentation where manual benchmark comparison wastes hours. The one-command execution is particularly valuable for developers new to local LLM deployment who don't yet understand the quantization/context-length/VRAM tradeoff space. Skip it if you need offline-first operation for airgapped environments, require precise control over benchmark weighting and scoring criteria, or already have deep expertise in LLM performance characteristics—manual evaluation with tools like llama.cpp's built-in benchmarks will give you more control. Also skip if you're deploying production inference servers where measured latency and throughput matter more than aggregate benchmark scores; in those cases, dedicated profiling tools like vLLM's benchmarking suite provide actionable metrics that whichllm's scoring abstraction obscures.