Documentation
Configuration
Environment variables, storage backends, inference settings, auth, and client config.
Server environment variables
Set these in your .env file, docker-compose.ymlenvironment section, or your deployment platform's secret store.
Core
| Variable | Default | Description |
|---|---|---|
| DATABASE_URL | (required) | PostgreSQL connection string — postgresql+asyncpg://user:pass@host/db |
| SECRET_KEY | (required) | JWT signing secret — generate with: openssl rand -hex 32 |
| REDIS_URL | redis://localhost:6379 | Redis connection string |
| ACCESS_TOKEN_EXPIRE_MINUTES | 10080 | Session token lifetime in minutes (default: 7 days) |
| ALLOWED_ORIGINS | * | CORS allowed origins — comma-separated list or * for all |
Object storage (R2 / MinIO)
All variables are prefixed R2_. Despite the name, any S3-compatible store works (AWS S3, MinIO, Backblaze B2, Cloudflare R2).
| Variable | Default | Description |
|---|---|---|
| R2_ENDPOINT_URL | http://minio:9000 | S3-compatible endpoint URL |
| R2_ACCESS_KEY_ID | minioadmin | Access key / key ID |
| R2_SECRET_ACCESS_KEY | minioadmin | Secret key |
| R2_BUCKET | inferllama | Bucket name — must already exist |
| R2_PUBLIC_URL | — | Public CDN URL prefix (optional; enables direct browser downloads) |
| CHUNK_SIZE_BYTES | 268435456 | Chunk size for content-addressed splitting (256 MB default) |
bash
# Cloudflare R2 example
R2_ENDPOINT_URL=https://abc123.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=your_key_id
R2_SECRET_ACCESS_KEY=your_secret
R2_BUCKET=inferllama-prod
R2_PUBLIC_URL=https://models.inferllama.comRate limits
| Variable | Default | Description |
|---|---|---|
| RATE_LIMIT_REQUESTS | 100 | Max requests per window per user |
| RATE_LIMIT_WINDOW_SECONDS | 60 | Rate limit window in seconds |
| UPLOAD_RATE_LIMIT_GB_PER_DAY | 10 | Max upload bytes per user per day (GB) |
Registration
| Variable | Default | Description |
|---|---|---|
| REGISTRATION_OPEN | true | Allow new user registration — set false to invite-only |
| DEFAULT_QUOTA_GB | 10 | Storage quota for new users in GB |
Tracker environment variables
| Variable | Default | Description |
|---|---|---|
| PORT | 3001 | Port the tracker HTTP server listens on |
| ANNOUNCE_URL | http://tracker:3001/announce | WebTorrent announce URL embedded in .torrent files |
| R2_* | (same as server) | Object storage config for storing .torrent files |
CLI client config
The CLI stores its configuration in ~/.inferllama/config.json:
json
{
"server_url": "http://localhost:8000",
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9…"
}Updated automatically by inferllama login.
Directory structure
| Path | Contents |
|---|---|
| ~/.inferllama/config.json | Registry URL and auth token |
| ~/.inferllama/models/ | Assembled GGUF files ready for inference |
| ~/.inferllama/chunks/ | Content-addressed 256 MB chunks (cache) |
| ~/.inferllama/bin/ | llama-server binary (if not installed via brew) |
Disk management
Chunks and models accumulate over time. To reclaim disk space:
bash
# Remove a model and its exclusive chunks
inferllama rm qwen2-0.5b-instruct
# Prune all chunks not referenced by any local model
inferllama cache prune
# Show disk usage by model
inferllama list --size✦ Tip
Shared chunks (used by multiple models) are never deleted by
rm. They are only removed when the last model referencing them is deleted.Inference configuration
inferllama run and inferllama serve launchllama-server under the hood. These flags control its behavior:
| CLI flag | llama-server param | Default | Description |
|---|---|---|---|
| --port | --port | 8080 | Port for the inference server |
| --ctx | --ctx-size | 4096 | Context window in tokens |
| --temperature | n/a (per-request) | 0.7 | Default sampling temperature |
| --max-tokens | n/a (per-request) | unlimited | Max tokens per completion |
| --system | n/a (per-session) | — | System prompt prepended to every message |
GPU / Metal acceleration
llama-server auto-detects Metal (macOS) and CUDA (Linux/Windows). If you installed via Homebrew, Metal is already enabled. For CUDA:
bash
# Install the CUDA-enabled build
inferllama setup --force --cuda
# Verify GPU layers are loaded (look for "ggml_cuda_init" in output)
inferllama run qwen2-0.5b-instruct 2>&1 | head -20Full .env example
bash
# ── Core ─────────────────────────────────────
DATABASE_URL=postgresql+asyncpg://postgres:password@postgres:5432/inferllama
SECRET_KEY=replace-me-with-32-hex-chars
REDIS_URL=redis://redis:6379
# ── Storage (production — Cloudflare R2) ─────
R2_ENDPOINT_URL=https://abc123.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=your_key
R2_SECRET_ACCESS_KEY=your_secret
R2_BUCKET=inferllama-prod
R2_PUBLIC_URL=https://models.inferllama.com
# ── Storage (dev — local MinIO) ─────────────
# R2_ENDPOINT_URL=http://minio:9000
# R2_ACCESS_KEY_ID=minioadmin
# R2_SECRET_ACCESS_KEY=minioadmin
# R2_BUCKET=inferllama
# ── Auth & registration ───────────────────────
REGISTRATION_OPEN=true
DEFAULT_QUOTA_GB=10
ACCESS_TOKEN_EXPIRE_MINUTES=10080
# ── Rate limiting ─────────────────────────────
RATE_LIMIT_REQUESTS=100
RATE_LIMIT_WINDOW_SECONDS=60
UPLOAD_RATE_LIMIT_GB_PER_DAY=10
# ── Next.js web ───────────────────────────────
NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_DAEMON_URL=http://localhost:11434