LLM Checker: Hardware-Aware Model Selection for Local Inference
Hook
The average developer wastes 4-6 hours trying different LLM models before finding one that actually runs on their hardware. LLM Checker eliminates this entirely with deterministic hardware scanning and intelligent model ranking.
Context
Running large language models locally has become increasingly popular as developers seek privacy, cost control, and offline capability. Ollama democratized local LLM deployment by packaging models with optimized runtimes, but it introduced a new problem: choice paralysis combined with resource constraints. With 200+ models available—ranging from 1B parameter models that fit on a Raspberry Pi to 70B+ behemoths requiring high-end GPUs—developers spend significant time in trial-and-error cycles. Download a 40GB model, attempt to run it, watch it crash from out-of-memory errors, delete it, and repeat.
The fundamental issue is information asymmetry. Model cards list parameter counts and quantization schemes (Q4_K_M, Q8_0), but translating those specifications into “will this run on my 16GB MacBook Pro?” requires understanding memory estimation formulas, GPU architecture differences, and context window overhead. LLM Checker emerged to solve this coordination problem by building a compatibility layer between hardware capabilities and model requirements, turning an empirical guessing game into a deterministic selection process.
Technical Insight
LLM Checker’s architecture is deceptively simple—a Node.js CLI that orchestrates hardware detection, model catalog fetching, memory estimation, and multi-dimensional scoring. The beauty lies in how these stages compose without requiring native compilation or platform-specific binaries.
The hardware detection phase uses Node’s os module combined with platform-specific command execution to identify GPU capabilities. On macOS, it shells out to system_profiler SPDisplaysDataType to detect Metal-compatible GPUs and unified memory. For NVIDIA systems, it parses nvidia-smi output to extract VRAM. AMD and Intel Arc detection follow similar patterns with rocm-smi and driver queries. The tool gracefully degrades—if GPU detection fails, it falls back to CPU-only recommendations based on available RAM.
// Simplified hardware detection example
const detectHardware = async () => {
const platform = process.platform;
let availableMemory = os.totalmem();
let gpuMemory = 0;
let accelerator = 'cpu';
if (platform === 'darwin') {
// Check for Apple Silicon unified memory
const output = execSync('system_profiler SPHardwareDataType').toString();
if (output.includes('Apple M1') || output.includes('Apple M2')) {
accelerator = 'metal';
gpuMemory = availableMemory; // Unified memory architecture
}
} else if (platform === 'linux' || platform === 'win32') {
try {
const nvidiaOutput = execSync('nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits').toString();
gpuMemory = parseInt(nvidiaOutput.trim()) * 1024 * 1024; // Convert MB to bytes
accelerator = 'cuda';
} catch (e) {
// Fall through to check AMD/Intel
}
}
return { availableMemory, gpuMemory, accelerator };
};
The memory estimation formula is where LLM Checker differentiates itself from naive calculators. It doesn’t just multiply parameters by bytes-per-parameter—it accounts for quantization schemes, context window overhead, and inference runtime requirements. A Q4_K_M quantized 7B model uses approximately 4.5 bits per parameter (not a clean 4), plus 20-30% overhead for KV cache and inference state. The tool maintains a calibration database that maps quantization schemes to empirical multipliers, validated against actual Ollama model sizes.
// Memory estimation with quantization awareness
const estimateMemory = (params, quantization, contextWindow = 2048) => {
const quantizationMap = {
'Q4_K_M': 4.5,
'Q4_K_S': 4.3,
'Q5_K_M': 5.5,
'Q8_0': 8.5,
'fp16': 16
};
const bitsPerParam = quantizationMap[quantization] || 4.5;
const modelBytes = (params * 1e9 * bitsPerParam) / 8;
// Context window overhead: ~2 bytes per token per parameter
const kvCacheBytes = contextWindow * params * 1e9 * 2 / 8;
// Runtime overhead (tokenizer, buffers, etc.)
const runtimeOverhead = modelBytes * 0.2;
return modelBytes + kvCacheBytes + runtimeOverhead;
};
The scoring system is where pragmatic engineering shines. Rather than a single fitness metric, LLM Checker evaluates models across four dimensions: Quality (benchmark performance from model cards), Speed (inversely proportional to parameter count and quantization level), Fit (memory headroom percentage), and Context (context window size relative to use case). These dimensions are weighted differently based on user-selected categories—‘coding’ tasks prioritize Quality and Context, ‘chat’ balances all four, ‘embedding’ focuses exclusively on Speed and Fit.
The optional SQLite integration enables more sophisticated queries. When enabled, the tool creates an indexed database of model metadata, allowing developers to filter by specific quantization schemes, search by capability tags, or generate side-by-side comparisons. This is particularly valuable in CI/CD environments where model selection needs to be scripted and auditable.
The calibration framework represents the most advanced feature. Developers can run llm-checker calibrate --suite coding to benchmark actual models with predefined prompt sets, measuring real-world latency, throughput, and quality metrics. The tool generates a routing policy file—a JSON mapping of use cases to optimal models—that overrides the deterministic scoring. This bridges the gap between theoretical estimates and production performance, allowing teams to codify their empirical findings.
// Example calibration policy output
{
"hardware_profile": "M2_MacBook_32GB",
"policies": [
{
"use_case": "code_completion",
"model": "codellama:13b-q4_K_M",
"rationale": "Best quality/speed tradeoff; 87ms p95 latency",
"fallback": "codellama:7b-q4_K_M"
},
{
"use_case": "chat",
"model": "mistral:7b-q5_K_M",
"rationale": "Highest user satisfaction in blind tests",
"fallback": "llama2:7b-q4_K_M"
}
]
}
Gotcha
LLM Checker’s tight coupling to Ollama creates both convenience and constraint. The calibration mode exclusively supports Ollama as the inference runtime, meaning developers running llama.cpp directly, using vLLM, or working with custom GGUF files can’t leverage benchmarking features. This is a pragmatic trade-off—Ollama provides a stable API surface and consistent model packaging—but it limits flexibility for teams with heterogeneous inference stacks.
The memory estimation formula, while calibrated, remains approximate. Real-world performance depends on factors the tool can’t detect: quantization quality variations between model families, context window usage patterns (rarely do users hit max context), inference optimizations in specific Ollama versions, and GPU memory fragmentation. A model estimated to need 14GB might run comfortably in 12GB or crash at 15GB depending on these variables. Developers should treat recommendations as starting points requiring validation, not guarantees. The tool also lacks awareness of concurrent model loading—if you’re running multiple models simultaneously or have other GPU-intensive processes, the memory calculations become meaningless. It assumes exclusive access to detected hardware resources.
Verdict
Use LLM Checker if you’re onboarding developers to local LLM workflows and want to eliminate the “which model should I try first?” confusion, deploying models across heterogeneous hardware (mixed Mac/Linux/Windows environments) and need consistent selection criteria, or building CI/CD pipelines that need programmatic model selection based on runner specs. The calibration framework justifies adoption for production teams standardizing model choices across development, staging, and production environments. Skip it if you’re already successfully running models and have empirically determined what works on your hardware, need runtime-agnostic benchmarking beyond Ollama’s ecosystem, or require deep performance profiling with latency percentiles and throughput metrics rather than binary “will it run” compatibility checks. For quick compatibility screening and intelligent recommendations, it’s the most pragmatic tool available; for comprehensive performance engineering, you’ll need additional profiling tools.