AI without a GPU
The GPU became the default hardware for AI because neural networks need it. Neural networks are not AI itself — they are one class of AI model. Change the model architecture and the hardware assumption changes with it. Logic-based AI runs on any 32-bit processor. No GPU. No NPU. No specialist silicon. Just compute.
A history, not a requirement.
GPUs became the standard hardware for AI because of the arithmetic at the core of neural networks. Understanding why this happened — and what it would take to change — requires understanding what that arithmetic actually is.
A neural network performs inference by passing input data through a sequence of layers. Each layer applies a matrix multiplication: a grid of numerical weights multiplied against the input values, summed, and passed through a non-linear function. A moderately sized network may perform this operation billions of times per inference. A large language model does it trillions of times.
The multiplications within each layer are independent of one another. They do not need to happen in sequence; they can all happen simultaneously. This is what GPUs are designed for: thousands of small, simple cores executing the same operation on different data at the same moment. A GPU does not execute neural network inference faster because it has a faster clock. It executes it faster because it executes thousands of matrix multiplications in parallel rather than one at a time.
A CPU does not work this way. A modern CPU has tens of cores, each far more powerful than a GPU core individually, but suited to sequential computation with branches and dependencies. Running neural network inference on a CPU means executing the same matrix operations that a GPU handles in parallel — sequentially. For large models, the throughput difference is orders of magnitude.
This is the origin of the GPU requirement. It is not a physical law. It is a consequence of the mathematical structure of neural networks and the parallel execution model that structure maps onto. Change the mathematical structure and the hardware requirement changes with it.
When the assumption breaks down.
Edge and embedded inference. Anomaly detection on a motor bearing. Acoustic classification on a microcontroller. State prediction in a building management system. These are classification problems with well-defined, constrained feature spaces. They do not require the representational depth of a large neural network, and they do not need the parallelism that makes GPUs worthwhile. A correctly designed model for these tasks runs efficiently on a CPU.
Battery-operated devices. A GPU's minimum power draw exceeds the entire power budget of a battery-operated sensor. Cloud inference via radio transmission carries its own energy cost — transmitting data to a cloud server is more energy-intensive than running inference locally. For any device on battery or harvested power, on-device inference on a low-power CPU or MCU is the only viable architecture.
Cost-sensitive deployments. When a product ships at volume on a microcontroller costing under one pound, adding a GPU or NPU to the bill of materials is not commercially viable. The hardware economics determine the deployment architecture. AI must run on the hardware already in the product, or it does not run at all.
Privacy-sensitive applications. When data must not leave the device — audio from occupied spaces, medical signals, commercially sensitive operational data — cloud inference is excluded by design. On-device inference on a CPU or MCU is the only architecture that satisfies the requirement. No network transmission means no interception risk, no third-party data storage, and no cloud governance implications.
Connectivity-independent applications. Industrial plant floors, remote infrastructure, maritime and aviation systems, underground networks. In these environments, network connectivity cannot be assumed. Cloud inference requires connectivity. On-device GPU inference requires GPU silicon that is neither economical nor appropriate for embedded hardware. On-device CPU inference requires neither.
Not all GPU-free AI is equal.
Three distinct approaches to GPU-free AI exist. They produce meaningfully different results in performance, accuracy, hardware compatibility, and deployment complexity.
Approach 1: Optimised neural networks on CPUs
The most common approach in TinyML. Quantisation reduces weight precision from 32-bit float to 8-bit integer, cutting memory requirements and accelerating inference on hardware with INT8 support. Pruning removes weights with low magnitude. Knowledge distillation trains a smaller network to approximate a larger one. Architecture design choices — MobileNet, EfficientNet-Lite, SqueezeNet — start compact by design.
These are genuine engineering achievements. They extend the reach of neural networks into hardware that was previously impractical. The limitation is structural: a quantised neural network is still a neural network. It still requires multiply-accumulate operations — just smaller ones. On a microcontroller with no INT8 hardware accelerator and no FPU, even a heavily quantised network is slow and power-hungry. The trade-off between model size, inference speed, and accuracy is a real constraint that optimisation cannot eliminate, only negotiate.
Approach 2: CPU-native inference runtimes
TensorFlow Lite Micro, ONNX Runtime for embedded targets, and similar frameworks make neural network deployment on constrained hardware substantially more accessible. They handle model serialisation, memory layout, operator selection, and hardware-specific optimisations. For teams without deep embedded ML expertise, they lower the barrier to a first deployment significantly.
The limitation is that these frameworks optimise the execution of neural networks on CPU hardware. They cannot change the fundamental arithmetic requirements of the models they execute. A runtime cannot make a neural network that requires a GPU run efficiently on a Cortex-M0 with no FPU. The hardware-software mismatch remains; the framework makes it more manageable within its practical range.
Approach 3: AI designed for CPUs and MCUs
The third approach addresses the problem at its root: use an AI model architecture that does not require floating-point matrix multiplication. Logic-Based Networks (LBNs) learn propositional logic clauses — structured rules describing patterns in training data — rather than weighted connexions. Inference is the evaluation of those clauses against new inputs: bitwise AND, OR, and NOT operations on integer data.
This is not a workaround. It is a different computational paradigm with different hardware requirements. Bitwise operations run natively on every digital processor. There is no FPU requirement, no parallel execution requirement, no specialist silicon of any kind. A Cortex-M0 with 4 KB of RAM executes LBN inference in the same instruction set it uses for everything else.
LBN vs neural network on the same chip.
The MLPerf Tiny anomaly detection benchmark provides a standardised comparison framework for AI inference on embedded hardware. The Literal Labs results below are from validated benchmark submissions on an Arm Cortex-M4 (STM32F4-class hardware, under $5 per unit), comparing an LBN against a neural network fully-connected autoencoder on identical hardware.
| Metric | Neural network | LBN | Improvement |
|---|---|---|---|
| Inference latency | Baseline | 54× faster | 54× |
| Energy per inference | Baseline | 52× less | 52× |
| Memory footprint | Several hundred KB | Low KB range | Orders of magnitude |
| FPU required | Yes (or significant slowdown) | No | — |
| Deterministic output | No | Yes | — |
The latency and energy figures reflect the architectural difference, not optimisation effort on either side. A neural network on a Cortex-M4 performs floating-point matrix multiplication. An LBN on the same Cortex-M4 performs bitwise operations on integers. The hardware executes the second class of operation faster and at lower energy — not because of any trick or compression, but because that is how the processor is built.
The practical consequence runs in two directions simultaneously. A deployment that requires a Cortex-M4 with hardware FPU to run a quantised neural network at acceptable speed runs the same classification task on a Cortex-M0 with an LBN. Hardware cost per node falls substantially. And a deployment that uses a Cortex-M4 for both sensor acquisition and LBN inference has a significant portion of its compute budget available for other tasks — communication handling, sensor fusion, control logic — without adding silicon.
Beyond hardware: the full cost structure.
GPU dependency in cloud AI has a direct cost structure: compute time charged per inference, API fees, and the infrastructure burden of running or subscribing to a cloud inference service. At low volumes, this is manageable. At the volumes typical of dense IoT deployments — thousands of sensors, hundreds of inferences per day each — the accumulated cost becomes a significant operating expense.
GPU-free on-device inference eliminates per-inference cost entirely. Once the model is trained and deployed, inference is free. No cloud API. No compute charge. No infrastructure fee. The inference cost model changes from ongoing operational spend to a fixed training investment.
The hardware implications are equally direct. When the target specification for an edge AI node does not include GPU or NPU capability, the device is specified to its actual functional requirements: sensor interfaces, connectivity, control I/O, and power management. The cost of specialist AI hardware is not added to the bill of materials. For products shipping at volume — tens of thousands to millions of units — this is a material cost difference that determines commercial viability.
The full commercial picture, including qualification costs, power infrastructure implications, and fleet-level operating economics, is covered in the total cost of AI deployment analysis.
Getting started with GPU-free AI.
The workflow from labelled data to a deployed GPU-free model follows three steps. The training infrastructure is cloud-based; the inference is entirely on-device.
1. Collect and label your data
Capture sensor or input data representing the states you want the model to classify. The labelling requirement is the same as for any supervised learning problem: examples of each class the model should recognise. Data volume requirements are modest compared to large neural network training — LBNs are sample-efficient.
2. Train with ModelMill
ModelMill takes labelled data and trains a Logic-Based Network, handling configuration, hyperparameter search, and hardware-specific optimisation automatically. The output is a model validated for the target hardware constraints: memory footprint, inference latency, and energy per inference.
3. Deploy the C-code SDK
ModelMill generates a self-contained C SDK: trained LBN, inference engine, build configuration, and integration documentation. It compiles with standard embedded toolchains and integrates with bare-metal or RTOS firmware environments. Supported targets include Arm Cortex-M (M0 through M7), RISC-V, ESP32, and x86.
The result is inference running on-device, on hardware already in the product, at no marginal cost per inference. No GPU. No cloud service. No ongoing compute bill. See AI on microcontrollers for the hardware-specific deployment details.
A decision framework.
The CPU vs GPU question resolves cleanly once the application requirements are specific. The following conditions define when each architecture is appropriate.
Choose GPU when:
The task involves large neural network inference on batched inputs: language model serving, large-scale image processing, generative model inference. The deployment is in a data centre or cloud environment with access to GPU infrastructure. Power consumption is not a binding constraint. Per-inference cost at data-centre scale is acceptable. Training any neural network of meaningful size.
Choose CPU (or MCU) when:
The deployment is on embedded or battery-powered hardware. The application requires single-sample inference at low latency. Power budget is constrained by battery, energy harvesting, or thermal limits. The hardware is already deployed and GPU silicon cannot be added without a hardware redesign. The application requires deterministic inference for safety or regulatory reasons. Cost per unit at volume makes GPU or NPU silicon impractical.
For edge and embedded inference — which describes the vast majority of industrial IoT, automotive embedded, infrastructure monitoring, and connected sensor applications — the CPU is not a compromise position. It is the correct architecture, provided the model running on it was designed for CPU execution rather than adapted from a GPU-native neural network. See the ModelMill platform for how LBN training produces models validated for the target CPU hardware from the outset.