Back to Articles

LLM Checker: Hardware-Aware Model Selection for Local Inference

[ View on GitHub ]

LLM Checker: Hardware-Aware Model Selection for Local Inference

Hook

The average developer wastes 4-6 hours trying different LLM models before finding one that actually runs on their hardware. LLM Checker eliminates this entirely with deterministic hardware scanning and intelligent model ranking.

Context

Running large language models locally has become increasingly popular as developers seek privacy, cost control, and offline capability. Ollama democratized local LLM deployment by packaging models with optimized runtimes, but it introduced a new problem: choice paralysis combined with resource constraints. With 200+ models available—ranging from 1B parameter models that fit on a Raspberry Pi to 70B+ behemoths requiring high-end GPUs—developers spend significant time in trial-and-error cycles. Download a 40GB model, attempt to run it, watch it crash from out-of-memory errors, delete it, and repeat.

The fundamental issue is information asymmetry. Model cards list parameter counts and quantization schemes (Q4_K_M, Q8_0), but translating those specifications into “will this run on my 16GB MacBook Pro?” requires understanding memory estimation formulas, GPU architecture differences, and context window overhead. LLM Checker emerged to solve this coordination problem by building a compatibility layer between hardware capabilities and model requirements, turning an empirical guessing game into a deterministic selection process.

Technical Insight

system info

nvidia-smi

system_profiler

rocm-smi

fallback

fetch

model specs

memory estimates

compatibility matrix

CLI Entry Point

Hardware Detector

GPU Detection

Model Catalog Fetcher

Memory Calculator

Scoring Engine

Results Display

CUDA VRAM

Metal Unified Memory

AMD VRAM

CPU RAM Only

System architecture — auto-generated

LLM Checker’s architecture is deceptively simple—a Node.js CLI that orchestrates hardware detection, model catalog fetching, memory estimation, and multi-dimensional scoring. The beauty lies in how these stages compose without requiring native compilation or platform-specific binaries.

The hardware detection phase uses Node’s os module combined with platform-specific command execution to identify GPU capabilities. On macOS, it shells out to system_profiler SPDisplaysDataType to detect Metal-compatible GPUs and unified memory. For NVIDIA systems, it parses nvidia-smi output to extract VRAM. AMD and Intel Arc detection follow similar patterns with rocm-smi and driver queries. The tool gracefully degrades—if GPU detection fails, it falls back to CPU-only recommendations based on available RAM.

// Simplified hardware detection example
const detectHardware = async () => {
  const platform = process.platform;
  let availableMemory = os.totalmem();
  let gpuMemory = 0;
  let accelerator = 'cpu';

  if (platform === 'darwin') {
    // Check for Apple Silicon unified memory
    const output = execSync('system_profiler SPHardwareDataType').toString();
    if (output.includes('Apple M1') || output.includes('Apple M2')) {
      accelerator = 'metal';
      gpuMemory = availableMemory; // Unified memory architecture
    }
  } else if (platform === 'linux' || platform === 'win32') {
    try {
      const nvidiaOutput = execSync('nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits').toString();
      gpuMemory = parseInt(nvidiaOutput.trim()) * 1024 * 1024; // Convert MB to bytes
      accelerator = 'cuda';
    } catch (e) {
      // Fall through to check AMD/Intel
    }
  }

  return { availableMemory, gpuMemory, accelerator };
};

The memory estimation formula is where LLM Checker differentiates itself from naive calculators. It doesn’t just multiply parameters by bytes-per-parameter—it accounts for quantization schemes, context window overhead, and inference runtime requirements. A Q4_K_M quantized 7B model uses approximately 4.5 bits per parameter (not a clean 4), plus 20-30% overhead for KV cache and inference state. The tool maintains a calibration database that maps quantization schemes to empirical multipliers, validated against actual Ollama model sizes.

// Memory estimation with quantization awareness
const estimateMemory = (params, quantization, contextWindow = 2048) => {
  const quantizationMap = {
    'Q4_K_M': 4.5,
    'Q4_K_S': 4.3,
    'Q5_K_M': 5.5,
    'Q8_0': 8.5,
    'fp16': 16
  };
  
  const bitsPerParam = quantizationMap[quantization] || 4.5;
  const modelBytes = (params * 1e9 * bitsPerParam) / 8;
  
  // Context window overhead: ~2 bytes per token per parameter
  const kvCacheBytes = contextWindow * params * 1e9 * 2 / 8;
  
  // Runtime overhead (tokenizer, buffers, etc.)
  const runtimeOverhead = modelBytes * 0.2;
  
  return modelBytes + kvCacheBytes + runtimeOverhead;
};

The scoring system is where pragmatic engineering shines. Rather than a single fitness metric, LLM Checker evaluates models across four dimensions: Quality (benchmark performance from model cards), Speed (inversely proportional to parameter count and quantization level), Fit (memory headroom percentage), and Context (context window size relative to use case). These dimensions are weighted differently based on user-selected categories—‘coding’ tasks prioritize Quality and Context, ‘chat’ balances all four, ‘embedding’ focuses exclusively on Speed and Fit.

The optional SQLite integration enables more sophisticated queries. When enabled, the tool creates an indexed database of model metadata, allowing developers to filter by specific quantization schemes, search by capability tags, or generate side-by-side comparisons. This is particularly valuable in CI/CD environments where model selection needs to be scripted and auditable.

The calibration framework represents the most advanced feature. Developers can run llm-checker calibrate --suite coding to benchmark actual models with predefined prompt sets, measuring real-world latency, throughput, and quality metrics. The tool generates a routing policy file—a JSON mapping of use cases to optimal models—that overrides the deterministic scoring. This bridges the gap between theoretical estimates and production performance, allowing teams to codify their empirical findings.

// Example calibration policy output
{
  "hardware_profile": "M2_MacBook_32GB",
  "policies": [
    {
      "use_case": "code_completion",
      "model": "codellama:13b-q4_K_M",
      "rationale": "Best quality/speed tradeoff; 87ms p95 latency",
      "fallback": "codellama:7b-q4_K_M"
    },
    {
      "use_case": "chat",
      "model": "mistral:7b-q5_K_M",
      "rationale": "Highest user satisfaction in blind tests",
      "fallback": "llama2:7b-q4_K_M"
    }
  ]
}

Gotcha

LLM Checker’s tight coupling to Ollama creates both convenience and constraint. The calibration mode exclusively supports Ollama as the inference runtime, meaning developers running llama.cpp directly, using vLLM, or working with custom GGUF files can’t leverage benchmarking features. This is a pragmatic trade-off—Ollama provides a stable API surface and consistent model packaging—but it limits flexibility for teams with heterogeneous inference stacks.

The memory estimation formula, while calibrated, remains approximate. Real-world performance depends on factors the tool can’t detect: quantization quality variations between model families, context window usage patterns (rarely do users hit max context), inference optimizations in specific Ollama versions, and GPU memory fragmentation. A model estimated to need 14GB might run comfortably in 12GB or crash at 15GB depending on these variables. Developers should treat recommendations as starting points requiring validation, not guarantees. The tool also lacks awareness of concurrent model loading—if you’re running multiple models simultaneously or have other GPU-intensive processes, the memory calculations become meaningless. It assumes exclusive access to detected hardware resources.

Verdict

Use LLM Checker if you’re onboarding developers to local LLM workflows and want to eliminate the “which model should I try first?” confusion, deploying models across heterogeneous hardware (mixed Mac/Linux/Windows environments) and need consistent selection criteria, or building CI/CD pipelines that need programmatic model selection based on runner specs. The calibration framework justifies adoption for production teams standardizing model choices across development, staging, and production environments. Skip it if you’re already successfully running models and have empirically determined what works on your hardware, need runtime-agnostic benchmarking beyond Ollama’s ecosystem, or require deep performance profiling with latency percentiles and throughput metrics rather than binary “will it run” compatibility checks. For quick compatibility screening and intelligent recommendations, it’s the most pragmatic tool available; for comprehensive performance engineering, you’ll need additional profiling tools.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/pavelevich-llm-checker.svg)](https://starlog.is/api/badge-click/llm-engineering/pavelevich-llm-checker)