Skip to main content

Self-hosting

Run the full InferLlama stack — registry, storage, tracker, and web UI — on your own infrastructure.

Architecture overview

A self-hosted InferLlama deployment consists of six services, all defined indocker-compose.yml:

ServicePortPurpose
server8000FastAPI registry API — models, files, auth, admin
web3000Next.js web UI
postgres5432Primary database (users, models, files, API keys)
redis6379Upload progress cache and rate-limit counters
minio9000 / 9001S3-compatible object storage (dev; replace with R2 in prod)
tracker3001WebTorrent tracker + .torrent generation for P2P downloads

Note

In production, replace MinIO with Cloudflare R2 (or any S3-compatible store) by setting theR2_* environment variables. See Configuration.

Docker Compose (local / dev)

Prerequisites

  • Docker 24+ with the Compose plugin (docker compose)
  • 4 GB RAM minimum for all services
  • Ports 3000, 8000, 3001, 9000–9001, 5432, 6379 available

Start everything

bash
# Clone the repo
git clone https://github.com/inferllama/inferllama.git
cd inferllama

# Start all services (builds images on first run)
docker compose up -d

# Follow logs
docker compose logs -f server web

The web UI is now available at http://localhost:3000 and the API athttp://localhost:8000.

Run database migrations

bash
docker compose exec server alembic upgrade head

Create the first admin user

bash
docker compose exec server python -c "
import asyncio
from database import AsyncSessionLocal
from models.user import User
from passlib.context import CryptContext

pwd = CryptContext(schemes=['bcrypt'])

async def create():
    async with AsyncSessionLocal() as db:
        u = User(username='admin', email='admin@localhost',
                 hashed_password=pwd.hash('ChangeMe123!'), role='admin')
        db.add(u)
        await db.commit()
        print('Created admin user')

asyncio.run(create())
"

Warning

Change the default password immediately after the first login.

Production deployment

Replace MinIO with Cloudflare R2

Set the following environment variables in your server service (via .envor your secrets manager). MinIO is automatically bypassed when R2_BUCKETis set.

bash
R2_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=<key>
R2_SECRET_ACCESS_KEY=<secret>
R2_BUCKET=inferllama-models
R2_PUBLIC_URL=https://models.inferllama.com   # optional CDN URL

Reverse proxy with nginx

nginx
server {
    listen 443 ssl;
    server_name inferllama.com;

    location /api/ {
        proxy_pass         http://localhost:8000/;
        proxy_http_version 1.1;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        # Large uploads need a high body size limit
        client_max_body_size 50G;
        proxy_request_buffering off;
    }

    location / {
        proxy_pass         http://localhost:3000/;
        proxy_http_version 1.1;
        proxy_set_header   Upgrade $http_upgrade;
        proxy_set_header   Connection "upgrade";
    }
}

Scaling

  • API server — stateless FastAPI; scale horizontally behind a load balancer. All state is in Postgres and Redis.
  • Storage — offload to R2 / S3. The server never stores blobs locally; it only proxies presigned URLs back to clients.
  • Tracker — the WebTorrent tracker is lightweight. One instance handles thousands of peers. If you need redundancy, run two and list both in the client config.
  • Database — use managed Postgres (Neon, Supabase, AWS RDS) with a connection pooler (PgBouncer) for high concurrency.

How storage works

InferLlama uses content-addressed chunked storage to eliminate redundant bytes:

  • Every uploaded file is split into 256 MB chunks. Each chunk is SHA-256 hashed to produce its key.
  • Before uploading a chunk, the server checks whether that hash already exists in the store. If it does, the chunk is skipped entirely — zero bytes transferred or billed.
  • A Q4_K_M and a Q8_0 variant of the same model typically share 60–80% of their chunks. Only the quantization-specific deltas are stored.
  • Fine-tunes that share a base model share all base-model chunks. Only the fine-tuned layers add new storage.

The result: storing ten quantization variants of a 7B model costs roughly the same as storing two or three, because the transformer weights are identical across variants.


P2P distribution with WebTorrent

InferLlama uses a BitTorrent tracker to distribute model chunks peer-to-peer. When multiple users have downloaded the same model, they automatically seed chunks to each other — reducing your egress bill and improving download speed for everyone.

  • When a file finishes uploading, the tracker service generates a.torrent file and stores it in R2.
  • The GET /v1/files/{model_id}/download/{file_id} endpoint returns both a direct presigned URL and the torrent_info_hash.
  • The CLI prefers the torrent path (via aria2c) when the tool is installed; it falls back to direct HTTP chunk download automatically.
  • The web UI's Playground downloads model metadata from the registry but does not participate in the P2P swarm (browsers use WebRTC; the CLI uses TCP BitTorrent).

Tip

Install aria2c on client machines to enable the torrent fast path:
bash
# macOS
brew install aria2

# Ubuntu / Debian
sudo apt install aria2

Tracker ports

The tracker listens on port 3001 (HTTP). For WebRTC peers to connect, the tracker must be reachable from the public internet on this port (or behind a reverse proxy that forwards it).

Warning

If deploying behind a firewall, ensure TCP and UDP port 3001 are open for the tracker, otherwise peers can only discover each other via the HTTP tracker (slower than WebRTC STUN).

Seeding models from Hugging Face

The sync_hf.py script indexes popular GGUF models from HuggingFace Hub directly into your registry — no full downloads required.

bash
# Dry-run: preview the top 50 models
python server/sync_hf.py --dry-run --limit 50

# Seed top 100 models into your registry
export INFERLLAMA_TOKEN=<your-admin-token>
python server/sync_hf.py --limit 100

# With a HF token for 10× rate limits
python server/sync_hf.py --limit 500 --hf-token hf_xxxx

Models are registered with status: remote — their metadata appears immediately in the registry, but the actual GGUF bytes are fetched on first download (pull-through caching).


Backup and restore

bash
# Backup Postgres
docker compose exec -T postgres pg_dump -U postgres inferllama \
  | gzip > inferllama-$(date +%F).sql.gz

# Restore
gunzip -c inferllama-2025-01-01.sql.gz \
  | docker compose exec -T postgres psql -U postgres inferllama

Object storage (R2 / MinIO) is backed up via your cloud provider's versioning and cross-region replication features.