Documentation
Self-hosting
Run the full InferLlama stack — registry, storage, tracker, and web UI — on your own infrastructure.
Architecture overview
A self-hosted InferLlama deployment consists of six services, all defined indocker-compose.yml:
| Service | Port | Purpose |
|---|---|---|
| server | 8000 | FastAPI registry API — models, files, auth, admin |
| web | 3000 | Next.js web UI |
| postgres | 5432 | Primary database (users, models, files, API keys) |
| redis | 6379 | Upload progress cache and rate-limit counters |
| minio | 9000 / 9001 | S3-compatible object storage (dev; replace with R2 in prod) |
| tracker | 3001 | WebTorrent tracker + .torrent generation for P2P downloads |
ℹ Note
R2_* environment variables. See Configuration.Docker Compose (local / dev)
Prerequisites
- ▸Docker 24+ with the Compose plugin (
docker compose) - ▸4 GB RAM minimum for all services
- ▸Ports 3000, 8000, 3001, 9000–9001, 5432, 6379 available
Start everything
# Clone the repo
git clone https://github.com/inferllama/inferllama.git
cd inferllama
# Start all services (builds images on first run)
docker compose up -d
# Follow logs
docker compose logs -f server webThe web UI is now available at http://localhost:3000 and the API athttp://localhost:8000.
Run database migrations
docker compose exec server alembic upgrade headCreate the first admin user
docker compose exec server python -c "
import asyncio
from database import AsyncSessionLocal
from models.user import User
from passlib.context import CryptContext
pwd = CryptContext(schemes=['bcrypt'])
async def create():
async with AsyncSessionLocal() as db:
u = User(username='admin', email='admin@localhost',
hashed_password=pwd.hash('ChangeMe123!'), role='admin')
db.add(u)
await db.commit()
print('Created admin user')
asyncio.run(create())
"⚠ Warning
Production deployment
Replace MinIO with Cloudflare R2
Set the following environment variables in your server service (via .envor your secrets manager). MinIO is automatically bypassed when R2_BUCKETis set.
R2_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=<key>
R2_SECRET_ACCESS_KEY=<secret>
R2_BUCKET=inferllama-models
R2_PUBLIC_URL=https://models.inferllama.com # optional CDN URLReverse proxy with nginx
server {
listen 443 ssl;
server_name inferllama.com;
location /api/ {
proxy_pass http://localhost:8000/;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Large uploads need a high body size limit
client_max_body_size 50G;
proxy_request_buffering off;
}
location / {
proxy_pass http://localhost:3000/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}Scaling
- ▸API server — stateless FastAPI; scale horizontally behind a load balancer. All state is in Postgres and Redis.
- ▸Storage — offload to R2 / S3. The server never stores blobs locally; it only proxies presigned URLs back to clients.
- ▸Tracker — the WebTorrent tracker is lightweight. One instance handles thousands of peers. If you need redundancy, run two and list both in the client config.
- ▸Database — use managed Postgres (Neon, Supabase, AWS RDS) with a connection pooler (PgBouncer) for high concurrency.
How storage works
InferLlama uses content-addressed chunked storage to eliminate redundant bytes:
- ▸Every uploaded file is split into 256 MB chunks. Each chunk is SHA-256 hashed to produce its key.
- ▸Before uploading a chunk, the server checks whether that hash already exists in the store. If it does, the chunk is skipped entirely — zero bytes transferred or billed.
- ▸A Q4_K_M and a Q8_0 variant of the same model typically share 60–80% of their chunks. Only the quantization-specific deltas are stored.
- ▸Fine-tunes that share a base model share all base-model chunks. Only the fine-tuned layers add new storage.
The result: storing ten quantization variants of a 7B model costs roughly the same as storing two or three, because the transformer weights are identical across variants.
P2P distribution with WebTorrent
InferLlama uses a BitTorrent tracker to distribute model chunks peer-to-peer. When multiple users have downloaded the same model, they automatically seed chunks to each other — reducing your egress bill and improving download speed for everyone.
- ▸When a file finishes uploading, the
trackerservice generates a.torrentfile and stores it in R2. - ▸The
GET /v1/files/{model_id}/download/{file_id}endpoint returns both a direct presigned URL and thetorrent_info_hash. - ▸The CLI prefers the torrent path (via
aria2c) when the tool is installed; it falls back to direct HTTP chunk download automatically. - ▸The web UI's Playground downloads model metadata from the registry but does not participate in the P2P swarm (browsers use WebRTC; the CLI uses TCP BitTorrent).
✦ Tip
aria2c on client machines to enable the torrent fast path:# macOS
brew install aria2
# Ubuntu / Debian
sudo apt install aria2Tracker ports
The tracker listens on port 3001 (HTTP). For WebRTC peers to connect, the tracker must be reachable from the public internet on this port (or behind a reverse proxy that forwards it).
⚠ Warning
Seeding models from Hugging Face
The sync_hf.py script indexes popular GGUF models from HuggingFace Hub directly into your registry — no full downloads required.
# Dry-run: preview the top 50 models
python server/sync_hf.py --dry-run --limit 50
# Seed top 100 models into your registry
export INFERLLAMA_TOKEN=<your-admin-token>
python server/sync_hf.py --limit 100
# With a HF token for 10× rate limits
python server/sync_hf.py --limit 500 --hf-token hf_xxxxModels are registered with status: remote — their metadata appears immediately in the registry, but the actual GGUF bytes are fetched on first download (pull-through caching).
Backup and restore
# Backup Postgres
docker compose exec -T postgres pg_dump -U postgres inferllama \
| gzip > inferllama-$(date +%F).sql.gz
# Restore
gunzip -c inferllama-2025-01-01.sql.gz \
| docker compose exec -T postgres psql -U postgres inferllamaObject storage (R2 / MinIO) is backed up via your cloud provider's versioning and cross-region replication features.