European AI Research Lab · Spain

Build AI
that stays
on your machine.

Ferrumox builds high-performance AI infrastructure for local and private deployment. No cloud dependency. No data leaving your hardware. Built in Spain, for Europe.

312 GitHub stars · Open source · MIT + Apache 2.0

Fox Inference Engine

A high-performance LLM runtime built from the ground up in Rust. Drop-in replacement for Ollama — with vLLM-level architecture.

🦊 fox v1.0-beta

The inference engine Europe was missing.

Fox implements PagedAttention and continuous batching — the same techniques behind vLLM — entirely in Rust. The result: dramatically lower first-token latency and 2× the throughput of Ollama on identical hardware. No cloud. No telemetry. Full control.

  • PagedAttention with ref-counted CoW KV block management
  • Block-level prefix caching — shared prefixes served from cache
  • Continuous batching with LIFO preemption for sustained throughput
  • Real stochastic sampling: temperature, top_p, top_k, seed
  • Prometheus metrics — TTFT, throughput, KV usage, prefix hit ratio
  • OpenAI-compatible REST API — drop-in, no client changes
  • Dual-licensed MIT + Apache 2.0
Performance vs Ollama RTX 4060 · 4 clients · 50 req
TTFT P50 −72%
87ms vs 310ms
TTFT P95 −72%
134ms vs 480ms
Response P50 −54%
412ms vs 890ms
Response P95 −53%
823ms vs 1740ms
Throughput +111%
312 t/s vs 148 t/s
VRAM Usage −40%
4.1 GB vs 6.8 GB

Research-grade internals.

Every component built for correctness first, then performance. No shortcuts in the critical path.

PagedAttention

Logical→physical KV block mapping with ref-counted copy-on-write infrastructure. Eliminates KV cache fragmentation and enables memory-efficient multi-tenant serving.

Prefix Caching

Block-level chain-hash prefix sharing. Repeated system prompts and multi-turn conversation history served directly from cache — zero recompute cost.

Continuous Batching

LIFO preemption scheduler saturates GPU compute across concurrent requests. Requests join and leave the batch dynamically — no idle cycles waiting for stragglers.

Rust Runtime

Zero garbage collector. Predictable latency under load, no GC pauses in the inference hot path. Memory safety without runtime overhead.

GGUF via llama.cpp FFI

Full quantization support through a thin, safe FFI layer over llama.cpp. Run Q4, Q5, Q8 models on consumer hardware with hardware-native acceleration.

Observability

Prometheus metrics at /metrics — request rates, TTFT histogram, KV cache utilization, prefix hit ratio. Production-ready monitoring from day one.

What we're building.

Now · 2026

Fox — Inference Engine

High-performance local LLM runtime in Rust. Open-source, privacy-first, production-ready.

Next · 2026–2027

Ferrumox Models

First proprietary models trained from scratch. Small, efficient, Spanish-first, optimized for local deployment on consumer hardware.

2027–2028

Research Lab

Published research on efficient inference, privacy-preserving architectures, and on-device AI. Europe's answer to the foundation model labs.

2028+

European AI Sovereignty

Full stack: runtime, models, training infrastructure — entirely European, entirely open, entirely yours.

AI that respects your data shouldn't require a compromise on performance.

  • Your data never leaves your infrastructure. Not as a feature — as an architectural guarantee.
  • Open weights, open source, reproducible training. No black boxes.
  • Built for European regulatory reality — AI Act compliance by design, not by patch.
  • Consumer hardware is enough. A 4060 should run a capable local model. We prove it.
  • Spanish and Iberian languages are first-class citizens, not an afterthought.
Open Source · MIT + Apache 2.0

Try Fox today.

Drop-in replacement for Ollama. Install in one command, benchmark against your current setup.