Infervisor — rebuilding AI inference from first principles

The inference platform · pre-release

infervisor — runtime v0.x · pre-release

infervisor:~$ plowrt serve

One inference runtime — LLMs today,
diffusion and voice next.

A first-principles rebuild of the inference engine in Rust. Models compile ahead of time into a fixed on-device execution plan, optimized by e-graph rewriting — every graph rewrite backed by a machine-checked Lean proof. Measured head-to-head against vLLM before we claim anything.

request access see how it works →

modalities LLM · diffusion/voice next targets NVIDIA · AMD · DPU* planned deploy on-prem│cloud pre-release

* A DPU doesn't run the model — it moves data. We plan to target DPUs to offload data movement in the inference pipeline (transfer, networking, staging), not for model compute; the model runs on the NVIDIA and AMD accelerators.

WHAT IT IS

Optimized at the system level. Not just the kernel.

The inference layer the category is missing — a single engine, rethought from first principles, that replaces a stack of vendor-locked runtimes.

System-level, from first principles

Not a bag of hand-tuned kernels. We’re rebuilding the inference engine end to end — the whole model compiled ahead of time into one fixed execution plan, so the accelerators run it without checking back with the host CPU.

One runtime, every modality

LLMs, diffusion, and voice on a single dataflow engine — one stack instead of three.

Any accelerator

One model graph, one compiler. NVIDIA and AMD run the model today; DPU-offloaded data movement is on the roadmap. No vendor lock-in.

PRODUCT · PLOW

Plow — a Packet Language
for On-device Workers.

The engine inside Infervisor — where a worker is a warp on NVIDIA and a wave on AMD. A model becomes a validated, target-specific execution artifact before it reaches production, so the serving path can stay small and predictable.

Plan before serving

The expensive work of understanding and optimizing a model happens before deployment, leaving fewer decisions on the request path.

Keep the runtime small

A lightweight Rust runtime loads a prepared artifact and executes it with the host CPU outside the steady-state model loop.

Measure every claim

Correctness gates come first. Latency, serving overhead, startup, and multi-model behavior are reported as separate results.

Read the research note for the motivation, evidence, limits, and evaluation plan. Detailed measured results live on the blog.

COMPANY

A high-performance research company
for AI inference.

We study where serving time actually goes — host orchestration, synchronization, startup, multi-model hand-offs — and build the systems that remove it: a compiler that plans the model before deployment and a small runtime that executes it on NVIDIA and AMD. Results are correctness-gated, measured by workload phase, and published with their limits.

Building production AI inference? Let's talk.

book a meeting email us →