LLM Checker: Hardware-Aware Model Selection for Local AI Inference
Hook
Pulling a 40GB LLM only to discover it won't fit in your GPU's VRAM is a rite of passage for local AI developers. LLM Checker eliminates this trial-and-error tax by predicting model compatibility before you download a single byte.
Context
The explosion of open-source LLMs created a paradox of choice that hardware amplifies. Ollama's registry alone contains over 200 model families spanning 7,000+ variants—each with different quantization levels (Q2_K, Q4_0, Q5_K_M, Q8_0), parameter counts (7B to 405B), and context windows (2K to 128K tokens). A developer with 16GB of unified memory on an M2 MacBook faces a combinatorial nightmare: Which Llama 3.1 variant will actually run? Should you sacrifice quality for speed? Will that 70B model at Q2_K quantization even fit, and if it does, will inference be uselessly slow?
The traditional workflow is brutal: read model cards, estimate memory requirements using rough rules of thumb (1.5x parameters for Q4, 2x for Q8), pull a multi-gigabyte model, attempt to run it, watch it fail or crawl, repeat. LLM Checker emerged to solve this by packaging hardware detection, a SQLite catalog of real Ollama model metadata, and a deterministic scoring engine into a single CLI tool. Instead of guessing, you run one command and get ranked recommendations calibrated to your exact hardware profile—whether you're on Apple Silicon, NVIDIA CUDA, AMD ROCm, or Intel Arc.
Technical Insight
LLM Checker's architecture revolves around three core components: hardware detection, a packaged SQLite catalog, and a multi-dimensional scoring engine. The tool is pure JavaScript with zero native dependencies, which means it runs on Node.js 16+ without compilation—including Android via Termux.
Hardware detection starts by probing system specs through Node.js APIs and parsing system info. On macOS, it detects Apple Silicon unified memory via sysctl. On Linux and Windows, it queries NVIDIA's nvidia-smi for CUDA GPUs and VRAM, checks for AMD ROCm devices, and identifies Intel Arc GPUs. The tool distinguishes between GPU VRAM and system RAM because memory architecture fundamentally changes model viability—unified memory systems can offload more aggressively than discrete GPUs.
The SQLite catalog is the secret weapon. Rather than hitting Ollama's API on every invocation, LLM Checker ships with a pre-populated database containing metadata for 200+ model families and 7,000+ variants. This includes parameter counts, quantization schemes, typical memory footprints, and context window sizes. The database is versioned and can be refreshed on-demand via the llm-checker sync command, which fetches fresh data from Ollama's registry. This offline-first design means the tool works in air-gapped environments and responds in milliseconds instead of seconds.
Here's how you'd use it to find models for a system with 24GB VRAM:
# Basic hardware scan and model recommendations
npx llm-checker scan
# Output:
# Hardware Profile:
# GPU: NVIDIA RTX 3090 (24GB VRAM)
# RAM: 64GB
# CPU: AMD Ryzen 9 5950X
#
# Top Recommended Models:
# 1. llama3.1:70b-instruct-q4_0 (Score: 87.3)
# Quality: 92, Speed: 78, Fit: 95, Context: 84
# 2. qwen2.5:32b-instruct-q5_k_m (Score: 85.1)
# Quality: 88, Speed: 84, Fit: 89, Context: 79
# 3. mistral:22b-instruct-v0.3-q8_0 (Score: 82.7)
# Quality: 95, Speed: 71, Fit: 88, Context: 76
The scoring engine is where LLM Checker differentiates itself. Each model receives four scores:
- Quality: Based on parameter count and quantization level. A 70B model at Q8_0 scores higher than a 7B at Q4_0.
- Speed: Inversely related to model size and directly related to available memory headroom. A model using 80% of VRAM scores lower than one using 60%.
- Fit: How well the model's memory footprint matches available hardware, accounting for OS overhead and context window allocation.
- Context: Evaluates the model's context window size relative to its memory footprint—longer contexts score higher if memory permits.
These scores are weighted by use case. The default profile balances all four, but you can override:
# Prioritize speed for real-time applications
npx llm-checker scan --policy speed
# Prioritize quality for offline batch processing
npx llm-checker scan --policy quality
# Custom weights via calibration fixtures
npx llm-checker scan --calibration ./my-weights.json
Memory estimation uses a bytes-per-parameter formula calibrated against actual Ollama model sizes. For example, Q4_0 quantization typically requires ~0.5 bytes per parameter plus overhead for context and KV cache. The formula looks like:
function estimateMemory(params, quantization, contextWindow) {
const bytesPerParam = QUANT_MAP[quantization]; // e.g., 0.5 for Q4_0
const modelSize = params * bytesPerParam;
const kvCacheSize = (contextWindow * params * 2 * 2) / 1e9; // Rough estimate
const overhead = modelSize * 0.1; // OS and runtime overhead
return modelSize + kvCacheSize + overhead;
}
LLM Checker also includes an ai-run command that executes models through Ollama and streams responses with live token-per-second metrics:
# Run a model and benchmark it
npx llm-checker ai-run llama3.1:70b-instruct-q4_0 "Explain quantum entanglement"
# Output streams tokens with live metrics:
# [32.4 tok/s] Quantum entanglement is a phenomenon where...
# [31.8 tok/s] particles become correlated in such a way...
# [33.1 tok/s] that measuring one instantly affects the other...
#
# Final: 847 tokens in 26.3s (32.2 tok/s avg)
This tight integration with Ollama means you can validate LLM Checker's recommendations immediately without switching tools. The deterministic scoring also means you can script model selection in CI/CD pipelines or edge deployment workflows where hardware varies.
Gotcha
LLM Checker's biggest limitation is its dependency on Ollama for actual model execution. The tool doesn't bundle inference engines—it's purely a recommendation and orchestration layer. If Ollama isn't installed or you're using llama.cpp, vLLM, or another runtime, LLM Checker's ai-run command won't work. You'll still get hardware-aware recommendations, but you lose the integrated benchmarking.
The memory estimation formula is calibrated to Ollama's quantization schemes as of the tool's last update. Ollama occasionally tweaks compression strategies or introduces new quantization formats (like the recent IQ variants). If you're using bleeding-edge quantizations or custom GGUF files, LLM Checker's predictions may be off by 10-20%. The SQLite catalog needs periodic syncing via llm-checker sync, and if you forget, you'll miss newly released models. The scoring algorithm is also deterministic and policy-based—it won't learn from your actual inference patterns. If you consistently run models at non-standard context lengths or with custom sampling parameters, the scores won't reflect your real-world performance. You'd need to create custom calibration fixtures, which requires understanding the tool's internals.
Verdict
Use if: You're working with Ollama and need to quickly identify which models will run on specific hardware without downloading dozens of gigabyte-sized files. LLM Checker shines on constrained systems—laptops, edge devices, developer workstations with mixed GPU setups—where trial-and-error is expensive. It's invaluable for teams deploying local LLMs across heterogeneous hardware or developers exploring Ollama's catalog for the first time. The offline SQLite catalog and pure JavaScript implementation also make it ideal for air-gapped or resource-limited environments. Skip if: You're already intimately familiar with your hardware limits and Ollama's model zoo, prefer cloud-hosted inference where hardware constraints don't matter, or need adaptive model selection based on runtime profiling rather than static hardware specs. If you're using inference engines other than Ollama as your primary runtime, or require real-time model switching based on load patterns, LLM Checker's deterministic approach won't fit your workflow.