Documentation

Configuration

Environment variables, storage backends, inference settings, auth, and client config.

Server environment variables

Set these in your .env file, docker-compose.ymlenvironment section, or your deployment platform's secret store.

Core

Variable	Default	Description
DATABASE_URL	(required)	PostgreSQL connection string — postgresql+asyncpg://user:pass@host/db
SECRET_KEY	(required)	JWT signing secret — generate with: openssl rand -hex 32
REDIS_URL	redis://localhost:6379	Redis connection string
ACCESS_TOKEN_EXPIRE_MINUTES	10080	Session token lifetime in minutes (default: 7 days)
ALLOWED_ORIGINS	*	CORS allowed origins — comma-separated list or * for all

Object storage (R2 / MinIO)

All variables are prefixed R2_. Despite the name, any S3-compatible store works (AWS S3, MinIO, Backblaze B2, Cloudflare R2).

Variable	Default	Description
R2_ENDPOINT_URL	http://minio:9000	S3-compatible endpoint URL
R2_ACCESS_KEY_ID	minioadmin	Access key / key ID
R2_SECRET_ACCESS_KEY	minioadmin	Secret key
R2_BUCKET	inferllama	Bucket name — must already exist
R2_PUBLIC_URL	—	Public CDN URL prefix (optional; enables direct browser downloads)
CHUNK_SIZE_BYTES	268435456	Chunk size for content-addressed splitting (256 MB default)

bash

# Cloudflare R2 example
R2_ENDPOINT_URL=https://abc123.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=your_key_id
R2_SECRET_ACCESS_KEY=your_secret
R2_BUCKET=inferllama-prod
R2_PUBLIC_URL=https://models.inferllama.com

Rate limits

Variable	Default	Description
RATE_LIMIT_REQUESTS	100	Max requests per window per user
RATE_LIMIT_WINDOW_SECONDS	60	Rate limit window in seconds
UPLOAD_RATE_LIMIT_GB_PER_DAY	10	Max upload bytes per user per day (GB)

Registration

Variable	Default	Description
REGISTRATION_OPEN	true	Allow new user registration — set false to invite-only
DEFAULT_QUOTA_GB	10	Storage quota for new users in GB

Tracker environment variables

Variable	Default	Description
PORT	3001	Port the tracker HTTP server listens on
ANNOUNCE_URL	http://tracker:3001/announce	WebTorrent announce URL embedded in .torrent files
R2_*	(same as server)	Object storage config for storing .torrent files

CLI client config

The CLI stores its configuration in ~/.inferllama/config.json:

json

{
  "server_url": "http://localhost:8000",
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9…"
}

Updated automatically by inferllama login.

Directory structure

Path	Contents
~/.inferllama/config.json	Registry URL and auth token
~/.inferllama/models/	Assembled GGUF files ready for inference
~/.inferllama/chunks/	Content-addressed 256 MB chunks (cache)
~/.inferllama/bin/	llama-server binary (if not installed via brew)

Disk management

Chunks and models accumulate over time. To reclaim disk space:

bash

# Remove a model and its exclusive chunks
inferllama rm qwen2-0.5b-instruct

# Prune all chunks not referenced by any local model
inferllama cache prune

# Show disk usage by model
inferllama list --size

✦ Tip

Shared chunks (used by multiple models) are never deleted by rm. They are only removed when the last model referencing them is deleted.

Inference configuration

inferllama run and inferllama serve launchllama-server under the hood. These flags control its behavior:

CLI flag	llama-server param	Default	Description
--port	--port	8080	Port for the inference server
--ctx	--ctx-size	4096	Context window in tokens
--temperature	n/a (per-request)	0.7	Default sampling temperature
--max-tokens	n/a (per-request)	unlimited	Max tokens per completion
--system	n/a (per-session)	—	System prompt prepended to every message

GPU / Metal acceleration

llama-server auto-detects Metal (macOS) and CUDA (Linux/Windows). If you installed via Homebrew, Metal is already enabled. For CUDA:

bash

# Install the CUDA-enabled build
inferllama setup --force --cuda

# Verify GPU layers are loaded (look for "ggml_cuda_init" in output)
inferllama run qwen2-0.5b-instruct 2>&1 | head -20

Full .env example

bash

# ── Core ─────────────────────────────────────
DATABASE_URL=postgresql+asyncpg://postgres:password@postgres:5432/inferllama
SECRET_KEY=replace-me-with-32-hex-chars
REDIS_URL=redis://redis:6379

# ── Storage (production — Cloudflare R2) ─────
R2_ENDPOINT_URL=https://abc123.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=your_key
R2_SECRET_ACCESS_KEY=your_secret
R2_BUCKET=inferllama-prod
R2_PUBLIC_URL=https://models.inferllama.com

# ── Storage (dev — local MinIO) ─────────────
# R2_ENDPOINT_URL=http://minio:9000
# R2_ACCESS_KEY_ID=minioadmin
# R2_SECRET_ACCESS_KEY=minioadmin
# R2_BUCKET=inferllama

# ── Auth & registration ───────────────────────
REGISTRATION_OPEN=true
DEFAULT_QUOTA_GB=10
ACCESS_TOKEN_EXPIRE_MINUTES=10080

# ── Rate limiting ─────────────────────────────
RATE_LIMIT_REQUESTS=100
RATE_LIMIT_WINDOW_SECONDS=60
UPLOAD_RATE_LIMIT_GB_PER_DAY=10

# ── Next.js web ───────────────────────────────
NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_DAEMON_URL=http://localhost:11434

Self-hosting