AI on microcontrollers

There are more than 30 billion microcontrollers in deployment. They monitor, control, and measure. Most of them do not think — not because they cannot compute, but because the AI algorithms available to them were designed for hardware they do not have. That constraint is being resolved. Not by making MCUs more like GPUs, but by making AI work the way MCUs actually do.

01 • The MCU as an AI platform

Thirty billion processors. Most of them idle.

The modern microcontroller is a sophisticated device. A mid-range Arm Cortex-M4 running at 168 MHz, with a hardware FPU, 192 KB of SRAM, and 1 MB of flash is a capable embedded processor. A lower-end Cortex-M0 — simpler, cheaper, and present in enormous volumes in industrial sensors, smart meters, and appliance controllers — has 32 KB of flash and 4 KB of RAM, runs at 48 MHz, and draws under a milliwatt in active operation.

These devices already run complex firmware. They handle sensor acquisition, digital signal processing, wireless protocol stacks, real-time control loops, and power management. The claim that they cannot run AI is not quite accurate. The more precise claim is that they cannot run neural networks at practical latency without hardware they do not have. The distinction matters, because the solution is not to replace the hardware — it is to change the algorithm.

The MCU opportunity is significant and largely untapped. Predictive maintenance on an industrial pump. Anomaly detection on a sewer sensor. Fault classification in a vehicle ECU. Battery health monitoring in a handheld device. Each of these tasks could be performed locally, in real time, on hardware already deployed — if the AI model were designed for the hardware rather than against it.

02 • Why neural networks struggle on MCUs

Three hard problems.

Memory

Even an aggressively compressed, quantised neural network requires substantial flash to store weights and SRAM to run inference. An entry-level Cortex-M0 has 4 KB of RAM. A practical anomaly detection neural network requires tens to hundreds of kilobytes. The model simply does not fit. Reducing the network further to meet the constraint degrades accuracy to the point where the model is no longer useful.

Floating-point operations

Neural network inference is built on floating-point matrix multiplication. Lower-end Cortex-M parts — M0, M0+, M1 — have no hardware FPU. Software-emulated floating point is slow and energy-expensive: orders of magnitude slower than native integer arithmetic on the same chip. Even on Cortex-M4 and M7 devices with hardware FPU support, the energy draw of continuous neural network inference can exceed what a battery-operated sensor can sustain.

Power consumption

Neural networks are energy-intensive by design. The compute intensity of floating-point matrix multiplication translates directly to current draw. For a sensor node on a coin cell or energy harvester, the power budget for inference may be measured in tens of microwatts. Standard neural network inference runs in the milliwatt range or above. The gap between what is available and what is required is often too large to bridge by optimisation alone.

TinyML frameworks address this partially. Quantisation reduces weight precision from 32-bit float to 8-bit integer. Pruning removes connections that contribute least to accuracy. Architecture choices like MobileNet and SqueezeNet start smaller. These are useful tools. They do not solve the underlying problem: a neural network, however compressed, remains a matrix requiring multiply-accumulate operations. On hardware without INT8 acceleration or without an FPU, even a heavily optimised network is slow, power-hungry, or both.

The correct response to a hardware-software mismatch is not always to compress the software. Sometimes it is to use a different class of software.

03 • Logic-based AI for MCUs

A different architecture.

Logic-Based Networks were built for constrained hardware, not adapted to it. Rather than compressing a GPU-native architecture into an MCU-hostile footprint, LBNs operate in propositional logic: AND, OR, and NOT applied to binarised inputs. These are operations that every digital processor executes natively in a single instruction, without floating-point hardware, without a co-processor, without any specialist silicon.

The model is not a matrix of weights. It is a set of logical clauses describing patterns in the training data. Inference evaluates those clauses against the current input: a sequence of bitwise comparisons that completes in nanoseconds to low microseconds on constrained hardware. Memory requirements are proportionally different: production LBN models for common MCU inference tasks fit in a few kilobytes of flash.

Integer-only arithmetic

No floating-point operations. No FPU requirement. LBNs run natively on Cortex-M0 parts with no hardware acceleration, and on every processor above them. The same model code runs on an entry-level sensor MCU and a high-performance embedded processor without modification.

Compact model footprint

Logical rule structures are substantially smaller than weight matrices. A model for a production anomaly detection task typically occupies a few kilobytes of flash. This fits within the constraints of low-end MCUs that cannot accommodate even the smallest neural network variant.

Deterministic inference

Every inference follows a fixed, traceable logical path. The same input always produces the same output — no probabilistic variation, no stochastic sampling, no hallucination. For safety-critical or regulated applications, this is often a certification requirement, not merely a nice property.

Because bitwise logical operations are among the cheapest computations in digital electronics, the energy cost of LBN inference is in a different class from neural network inference. At a sensor node level, this is the difference between a product that requires battery replacement every few months and one that operates for years unattended. The battery-powered AI section covers the field-measured energy figures in detail. For the broader architectural question of why GPUs became standard and when that assumption breaks down, see AI without a GPU.

04 • Supported MCU families

Standard C deployment.

ModelMill generates deployable C-code SDKs for standard 32-bit processor architectures. The inference engine is portable C with no external dependencies, no dynamic memory allocation, and no runtime requirements. If the target runs a C compiler, it runs Logic-Based Networks.

Arm Cortex-M

The dominant MCU architecture across industrial, automotive, consumer, and IoT applications. LBNs are validated across the full Cortex-M range — M0, M0+, M3, M4, M7. Because there is no FPU requirement, the most constrained Cortex-M0 parts are viable deployment targets alongside high-performance M7 devices. A single SDK works across all Cortex-M variants without recompilation for the LBN itself.

ESP32

Espressif's dual-core Xtensa processor with integrated Wi-Fi and Bluetooth. The standard platform for connected IoT products in consumer and light industrial applications. LBN deployment on ESP32 is straightforward: the SDK integrates into standard ESP-IDF projects, adding on-device intelligence to existing sensor designs without hardware changes. Wireless connectivity becomes an uplink for classified events rather than a raw data pipe.

RISC-V

The open instruction set gaining traction in industrial embedded applications and cost-sensitive designs. LBNs deploy on RISC-V through the standard C-code SDK, without architecture-specific modification. As RISC-V silicon matures and reaches higher production volumes, the absence of any ISA-specific dependency in the inference engine makes forward compatibility straightforward.

05 • Firmware integration

The embedded engineer's view.

The deployment question for embedded engineers is not whether an LBN can run on their hardware — it can, on anything with a 32-bit processor and a C compiler. The question is how it integrates with existing firmware. The answer is designed to be unobtrusive.

The C SDK contains four components: the trained LBN model as a set of compiled logical clauses, the inference engine as a portable C library, a build configuration matching the target hardware, and integration documentation. There are no runtime dependencies. No dynamic memory allocation. No background threads or interrupt handlers that the firmware team does not own.

Integration is a function call within the sensor sampling loop already present in the firmware. The function takes preprocessed sensor data, runs inference, and returns a classification result. On a Cortex-M4 at 168 MHz, this takes microseconds. The loop continues. The firmware carries the same structure it had before the model was added.

The preprocessing step — converting raw sensor readings into the feature format the model expects — is documented in the SDK and performed by the firmware, not the inference engine. For vibration analysis this typically means an FFT producing a frequency spectrum. For time-series sensor data it may mean windowing and normalisation. The preprocessing logic lives in the firmware engineer's code; the inference logic lives in the SDK. The boundary between them is explicit.

Compatibility with bare-metal environments is complete: no RTOS is required. FreeRTOS environments are supported equally. The SDK does not impose any threading model on the firmware around it.

06 • Real-world performance

Validated in production.

The figures below come from production deployments and validated benchmark submissions, measured on real hardware in real operating conditions.

Wastewater network monitoring

Anomaly detection across distributed sewer infrastructure across England and Wales. The LBN model runs on an existing low-cost sensor with a lithium battery, consuming 455µJ per inference at a five-second prediction interval. Projected operational life before battery replacement: ten years. The alternative approach — a neural network requiring mains-power infrastructure — would have cost approximately £15,000 per installation site. At 100,000 sites, the cost of mains power infrastructure alone made the neural network approach commercially inviable before any software cost was considered.

Automotive ADAS

Vehicle dynamics classification on a NXP PowerPC e200 — a common automotive embedded processor from approximately 2006. LBN inference completes in four microseconds. The existing SoC handles the workload without modification, without a new qualification exercise, and without the 200% SoC cost increase the neural network alternative required. The customer's hardware programme timeline was unaffected because no new silicon was introduced.

Industrial predictive maintenance

Multimodal sensor fusion for predictive maintenance on an MCU hardware platform in the semiconductor industry. LBN inference: 0.7 ms latency, 2.6µJ per prediction. The workload that previously required a dedicated deep-learning hardware accelerator was offloaded to the on-chip CPU. The accelerator was freed for other use. The inference power draw fell by 40 times.

The pattern across deployments is consistent: existing hardware, new capability, no additional silicon. The total cost of AI deployment section covers the commercial implications of this pattern in detail.

07 • How to deploy AI on a microcontroller

Four steps from data to firmware.

1. Collect and label sensor data

Capture readings representing the conditions the model needs to classify: healthy operation versus anomaly, state A versus state B, fault condition versus normal. Data volume requirements are modest compared to large neural network training sets — LBNs are sample-efficient by design.

2. Train in ModelMill

Upload labelled data to ModelMill and define the target deployment hardware: memory constraints, target architecture, acceptable inference latency, and optimisation priority. ModelMill auto-configures, trains, and benchmarks hundreds of LBN candidates, selecting models that meet the target constraints and accuracy requirements.

3. Export the C-code SDK

The SDK is a self-contained archive: inference engine, trained model, build configuration for the target architecture, example integration code, and documentation. No external libraries. No cloud connection required at inference time.

4. Integrate and flash

The SDK integrates into standard embedded C projects. One function call in the sensor sampling loop. Compatible with bare-metal and FreeRTOS environments. Build with the existing toolchain. Flash to the target device. Inference runs locally, in real time, deterministically.

Get started

ModelMill takes labelled sensor data and delivers a deployable edge AI model — no GPU, no NPU, no new hardware. The SDK integrates with your existing embedded toolchain.

See how ModelMill deploys to MCUs