Edge Inference Journey — Dianmu Zhang

01 · The Constraint

Real-time edge inference has brutal requirements. For a video-based feature the model must produce at least 30 inferences per second. For inking prediction — where ink must appear to flow from your pen tip as you write — the bar is even higher: 200+ inferences per second.

And the model can't monopolize your machine. No GPU. No cloud call. It has to run on a CPU or a specialized chip called an NPU — fast enough to be invisible, light enough to share your computer with everything else you're doing.

200+

inferences / sec (inking)

30+

frames / sec (video)

0

GPU allowed

02 · The Process

It starts with gathering the right data, then training a model from scratch with an architecture designed for efficiency from the first layer. Then comes model optimization: quantization (shrinking weights from 32-bit floats to 8- or 4-bit integers) and pruning (removing weights that contribute little to accuracy).

Before it can run on a real device, the model needs to be ported to use the NPU's instruction set — a non-trivial translation of math into specialized hardware calls. Then you test it on the real device. And often, even after all that, it's still too slow.

So you go back. Redesign the architecture. Run the optimization pipeline again. Test again. This loop repeats — on average more than 30 times.

03 · The Inking Model

The inking prediction model I built makes ink feel like it flows directly from your pen tip when you write or draw on a Windows tablet. The system predicts where your stroke is going before your pen physically gets there — eliminating the perceptible lag between stylus and ink.

It was the first widely available AI model running on-device for Windows, shipped across Microsoft's entire Windows ecosystem. Most users never notice it — which is exactly the point.

Real AI features work like air. You breathe and feel good, but barely notice it. — on what shipping AI that works actually looks like

04 · What's Behind Every Edge Model

Every AI model you use on your phone or laptop has a graveyard of 30+ predecessors — each with elaborate architecture decisions and painstaking optimization. The one that shipped doesn't look much like the one that started. It was forged through iteration — and the discipline to discard what worked, treating "good enough" as a reason to start over, not stop.

The inking model is exactly that. A small neural network, invisible in use, representing thousands of engineering hours, 30+ design cycles, and a precise understanding of what real-time actually means at the hardware level.

The Hero's Journey toEdge Device Inference

01 · The Constraint

02 · The Process

03 · The Inking Model

04 · What's Behind Every Edge Model

The Hero's Journey to
Edge Device Inference