How LLM Checker Solves the Local Model Selection Problem with 4D Scoring
Hook
With access to 200+ models through Ollama—each with multiple quantization levels—developers waste hours downloading incompatible models only to discover they won’t fit in VRAM or run too slowly to be useful. LLM Checker eliminates this trial-and-error loop entirely.
Context
The local LLM ecosystem has exploded in complexity, with models available in numerous parameter sizes and quantization levels. Each combination has radically different memory requirements and performance characteristics. Developers with limited VRAM face a paradox of choice: should they run a larger model with aggressive quantization or a smaller model at higher precision? What about context length trade-offs?
This complexity is compounded by hardware diversity. Apple Silicon users have unified memory architectures where RAM doubles as VRAM. NVIDIA users must account for CUDA compatibility. AMD ROCm support varies by configuration. Intel Arc owners have distinct requirements. Existing solutions provide raw size estimates but no comparative analysis. Ollama’s model browser lists available models without hardware-aware filtering. LLM Checker bridges this gap with a deterministic scoring engine that transforms hardware specifications into actionable model recommendations.
Technical Insight
LLM Checker’s architecture centers on a multi-dimensional scoring system that evaluates models across four weighted axes: Quality (parameter count and quantization level), Speed (memory bandwidth requirements), Fit (available headroom after loading), and Context (maximum sequence length support). The tool operates in pure JavaScript with zero native dependencies, making it uniquely portable—it runs identically on Linux workstations, macOS laptops, and even Android devices via Termux.
The hardware detection layer introspects system capabilities to identify Apple Silicon, NVIDIA CUDA, AMD ROCm, Intel Arc, CPU, and integrated/dedicated GPU configurations. The detection module returns a normalized hardware profile containing discrete CPU RAM, dedicated GPU VRAM, and integrated GPU memory allocations.
Memory estimation uses what the README describes as a “bytes-per-parameter formula validated against real Ollama sizes.” While the specific implementation details aren’t publicly documented, the system applies quantization-aware coefficients to parameter counts plus runtime overhead to estimate total memory requirements.
The scoring engine evaluates each model against the hardware profile. Quality scores favor higher parameter counts and less aggressive quantization. Speed scores penalize models whose memory requirements approach available bandwidth limits. Fit scores reward models that leave headroom for context and runtime allocations. Context scores prioritize models supporting longer sequence lengths for use cases like document analysis.
Use-case profiles apply different weightings to these dimensions. The coding category weighs Quality and Context heavily since developers need accurate completions over long code files. The chat category balances all four dimensions equally. The speed category prioritizes Speed and Fit over Quality, accepting lower precision for faster response times. Users can specify categories via --category flags or let the tool apply balanced defaults.
The calibration workflow takes this further by generating empirical routing policies. Running llm-checker calibrate executes a benchmark suite against actual models, measuring real inference latency and memory consumption. The output is a routing policy file mapping use cases to specific models based on measured performance rather than theoretical estimates. This calibrated approach eliminates the gap between formula-based predictions and runtime reality.
Ollama integration is first-class. The ai-run command combines hardware detection, model selection, and execution in a single flow. Running llm-checker ai-run --category coding --prompt "Write a Python REST API" detects hardware, scores the dynamic model pool, selects the top-ranked compatible model, verifies it’s pulled via ollama list, pulls it if missing, and executes the prompt. The --calibrated flag switches from deterministic scoring to policy-based routing using pre-generated calibration data.
The optional SQLite integration unlocks advanced search capabilities. After running npm install sql.js and executing llm-checker sync, the tool builds a local database of the full Ollama model registry—typically 200+ models versus the 35 curated fallback set. The search command enables searching across model names, tags, and descriptions. The smart-recommend command combines database search with hardware scoring to filter the entire catalog against your specific system constraints.
Gotcha
LLM Checker’s Ollama-only integration is both its strength and constraint. If you’re running models through llama.cpp directly, vLLM for production serving, or HuggingFace Transformers for custom inference pipelines, this tool provides no value—it won’t detect those models or provide recommendations for non-Ollama workflows. The architecture assumes Ollama as the inference runtime, and all recommendations output Ollama-specific pull and run commands.
The memory estimation formulas, while described as calibrated against real model sizes, remain approximations. System-specific factors like background processes consuming VRAM or varying runtime overhead can affect actual memory usage. A model that scores as perfectly compatible might still encounter out-of-memory errors depending on your specific system state. The calibration workflow mitigates this by measuring actual consumption on your hardware, but users relying on the deterministic scoring mode should treat memory estimates as informed predictions rather than guarantees.
Verdict
Use LLM Checker if you’re running Ollama locally and need to eliminate guesswork around model compatibility—especially valuable if you have constrained hardware like 8GB MacBooks, mid-range NVIDIA GPUs with limited VRAM, or AMD cards where ROCm support may be inconsistent. The calibration workflow is essential for production use cases where you need empirical routing policies rather than heuristic estimates. It’s also ideal for developers new to local LLMs who lack intuition about quantization trade-offs and parameter scaling. Skip it if you’re using inference frameworks other than Ollama, running in cloud environments where you can provision arbitrary resources on demand, or if you’re already fluent in model sizing calculations and prefer manual control over your model selection process. Also skip if you need multi-framework support—LLM Checker won’t help you compare Ollama vs llama.cpp vs vLLM performance for the same model.