Build AI
that stays
on your machine.
Ferrumox builds high-performance AI infrastructure for local and private deployment. No cloud dependency. No data leaving your hardware. Built in Spain, for Europe.
Fox Inference Engine
A high-performance LLM runtime built from the ground up in Rust. Drop-in replacement for Ollama — with vLLM-level architecture.
The inference engine Europe was missing.
Fox implements PagedAttention and continuous batching — the same techniques behind vLLM — entirely in Rust. The result: dramatically lower first-token latency and 2× the throughput of Ollama on identical hardware. No cloud. No telemetry. Full control.
- PagedAttention with ref-counted CoW KV block management
- Block-level prefix caching — shared prefixes served from cache
- Continuous batching with LIFO preemption for sustained throughput
- Real stochastic sampling: temperature, top_p, top_k, seed
- Prometheus metrics — TTFT, throughput, KV usage, prefix hit ratio
- OpenAI-compatible REST API — drop-in, no client changes
- Dual-licensed MIT + Apache 2.0
Research-grade internals.
Every component built for correctness first, then performance. No shortcuts in the critical path.
PagedAttention
Logical→physical KV block mapping with ref-counted copy-on-write infrastructure. Eliminates KV cache fragmentation and enables memory-efficient multi-tenant serving.
Prefix Caching
Block-level chain-hash prefix sharing. Repeated system prompts and multi-turn conversation history served directly from cache — zero recompute cost.
Continuous Batching
LIFO preemption scheduler saturates GPU compute across concurrent requests. Requests join and leave the batch dynamically — no idle cycles waiting for stragglers.
Rust Runtime
Zero garbage collector. Predictable latency under load, no GC pauses in the inference hot path. Memory safety without runtime overhead.
GGUF via llama.cpp FFI
Full quantization support through a thin, safe FFI layer over llama.cpp. Run Q4, Q5, Q8 models on consumer hardware with hardware-native acceleration.
Observability
Prometheus metrics at /metrics — request rates, TTFT histogram, KV cache utilization, prefix hit ratio. Production-ready monitoring from day one.
What we're building.
Fox — Inference Engine
High-performance local LLM runtime in Rust. Open-source, privacy-first, production-ready.
Ferrumox Models
First proprietary models trained from scratch. Small, efficient, Spanish-first, optimized for local deployment on consumer hardware.
Research Lab
Published research on efficient inference, privacy-preserving architectures, and on-device AI. Europe's answer to the foundation model labs.
European AI Sovereignty
Full stack: runtime, models, training infrastructure — entirely European, entirely open, entirely yours.
AI that respects your data shouldn't require a compromise on performance.
- Your data never leaves your infrastructure. Not as a feature — as an architectural guarantee.
- Open weights, open source, reproducible training. No black boxes.
- Built for European regulatory reality — AI Act compliance by design, not by patch.
- Consumer hardware is enough. A 4060 should run a capable local model. We prove it.
- Spanish and Iberian languages are first-class citizens, not an afterthought.
Try Fox today.
Drop-in replacement for Ollama. Install in one command, benchmark against your current setup.