European AI Research Lab · Spain

Build AI
that stays
on your machine.

Ferrumox builds high-performance AI infrastructure for local and private deployment. No cloud dependency. No data leaving your hardware. Built in Spain, for Europe.

Explore Fox View on GitHub

312 GitHub stars · Open source · MIT + Apache 2.0

Product · v1.0-beta

Fox Inference Engine

A high-performance LLM runtime built from the ground up in Rust. Drop-in replacement for Ollama — with vLLM-level architecture.

🦊 fox v1.0-beta

The inference engine Europe was missing.

Fox implements PagedAttention and continuous batching — the same techniques behind vLLM — entirely in Rust. The result: dramatically lower first-token latency and 2× the throughput of Ollama on identical hardware. No cloud. No telemetry. Full control.

PagedAttention with ref-counted CoW KV block management
Block-level prefix caching — shared prefixes served from cache
Continuous batching with LIFO preemption for sustained throughput
Real stochastic sampling: temperature, top_p, top_k, seed
Prometheus metrics — TTFT, throughput, KV usage, prefix hit ratio
OpenAI-compatible REST API — drop-in, no client changes
Dual-licensed MIT + Apache 2.0

Performance vs Ollama RTX 4060 · 4 clients · 50 req

TTFT P50 −72%

87ms vs 310ms

TTFT P95 −72%

134ms vs 480ms

Response P50 −54%

412ms vs 890ms

Response P95 −53%

823ms vs 1740ms

Throughput +111%

312 t/s vs 148 t/s

VRAM Usage −40%

4.1 GB vs 6.8 GB

Architecture

Research-grade internals.

Every component built for correctness first, then performance. No shortcuts in the critical path.

PagedAttention

Logical→physical KV block mapping with ref-counted copy-on-write infrastructure. Eliminates KV cache fragmentation and enables memory-efficient multi-tenant serving.

Prefix Caching

Block-level chain-hash prefix sharing. Repeated system prompts and multi-turn conversation history served directly from cache — zero recompute cost.

Continuous Batching

LIFO preemption scheduler saturates GPU compute across concurrent requests. Requests join and leave the batch dynamically — no idle cycles waiting for stragglers.

Rust Runtime

Zero garbage collector. Predictable latency under load, no GC pauses in the inference hot path. Memory safety without runtime overhead.

GGUF via llama.cpp FFI

Full quantization support through a thin, safe FFI layer over llama.cpp. Run Q4, Q5, Q8 models on consumer hardware with hardware-native acceleration.

Observability

Prometheus metrics at /metrics — request rates, TTFT histogram, KV cache utilization, prefix hit ratio. Production-ready monitoring from day one.

Roadmap

What we're building.

Now · 2026

Fox — Inference Engine

High-performance local LLM runtime in Rust. Open-source, privacy-first, production-ready.

Next · 2026–2027

Ferrumox Models

First proprietary models trained from scratch. Small, efficient, Spanish-first, optimized for local deployment on consumer hardware.

2027–2028

Research Lab

Published research on efficient inference, privacy-preserving architectures, and on-device AI. Europe's answer to the foundation model labs.

2028+

European AI Sovereignty

Full stack: runtime, models, training infrastructure — entirely European, entirely open, entirely yours.

AI that respects your data shouldn't require a compromise on performance.

Your data never leaves your infrastructure. Not as a feature — as an architectural guarantee.
Open weights, open source, reproducible training. No black boxes.
Built for European regulatory reality — AI Act compliance by design, not by patch.
Consumer hardware is enough. A 4060 should run a capable local model. We prove it.
Spanish and Iberian languages are first-class citizens, not an afterthought.

Open Source · MIT + Apache 2.0

Try Fox today.

Drop-in replacement for Ollama. Install in one command, benchmark against your current setup.

GitHub Repository Read the Docs

Build AI that stays on your machine.