Documentation
CLI reference
Complete reference for every inferllama command, flag, and in-session slash command.
Global flags
| Flag | Description |
|---|---|
| --help | Show help for any command |
| --version | Print the installed CLI version |
| --server <url> | Override the registry URL for this invocation only |
inferllama run
Pull the model if needed, start a local inference server, and open an interactive streaming chat session — all in one command. This is the primary way to use InferLlama.
inferllama run <model> [options]| Flag | Default | Description |
|---|---|---|
| --port <int> | 8080 | Port for the local llama-server process |
| --ctx <int> | 4096 | Context window size in tokens |
| --system <string> | — | System prompt prepended to every conversation |
| --temperature <float> | 0.7 | Sampling temperature (0.0 = deterministic, 2.0 = very random) |
| --max-tokens <int> | — | Maximum tokens per response (unlimited by default) |
| --server-only | false | Start inference server without opening chat (useful for integrations) |
# Basic interactive chat
inferllama run qwen2-0.5b-instruct
# Custom system prompt and low temperature for coding
inferllama run qwen2-0.5b-instruct \
--system "You are a concise expert software engineer." \
--temperature 0.2
# Start inference server only, don't open chat
inferllama run llama3-8b-instruct --server-only --port 9000✦ Tip
inferllama run automatically pulls it first — no need to run inferllama pull separately.inferllama pull
Download a model from the registry to local storage without starting inference. Uses content-addressed 256 MB chunks — chunks already present locally are skipped, so fine-tunes that share a base model only download the new bytes.
inferllama pull <model>inferllama pull qwen2-0.5b-instruct
inferllama pull llama3-8b-instructDownloads are shown as a progress bar with transfer speed and estimated time remaining. If the registry exposes a .torrent for the model, InferLlama automatically tries the torrent path first (via aria2c if installed) and falls back to direct HTTP if no peers are available.
inferllama list
List all models stored on the local machine.
inferllama listNAME SIZE MODIFIED
qwen2-0.5b-instruct 954 MB 2 hours ago
llama3-8b-instruct 4.7 GB yesterday
mistral-7b-instruct 4.1 GB 3 days agoinferllama serve
Start an OpenAI-compatible inference proxy daemon. The daemon listens on a local port and exposes POST /v1/chat/completions compatible with any OpenAI SDK, LangChain, or HTTP client.
inferllama serve [options]| Flag | Default | Description |
|---|---|---|
| --port <int> | 11434 | Port for the daemon to listen on |
# Start the daemon (foreground)
inferllama serve
# Verify it is running
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2-0.5b-instruct","messages":[{"role":"user","content":"ping"}]}'ℹ Note
systemd, launchd, or pm2 start "inferllama serve".inferllama push
Upload a local model file to the registry. The file is split into 256 MB content-addressed chunks. If you already uploaded the base model, fine-tune variants only transfer the new chunks — identical blocks are deduplicated automatically.
inferllama push <file> [options]| Flag | Description |
|---|---|
| --name <string> | Registry model name (defaults to filename stem) |
| --display-name <string> | Human-readable display name |
| --arch <string> | Architecture (llama, mistral, qwen, phi, gemma, …) |
| --quant <string> | Quantization (Q4_K_M, Q8_0, F16, …) |
| --license <string> | License identifier (MIT, Apache-2.0, …) |
| --tags <list> | Comma-separated tags |
| --description <string> | Short model description |
inferllama push ./my-model-7b-q4_k_m.gguf \
--name my-model-7b \
--arch llama \
--quant Q4_K_M \
--license MITinferllama login
Authenticate with a registry server. The token is saved to~/.inferllama/config.json.
inferllama login [options]| Flag | Description |
|---|---|
| --server <url> | Your registry server URL (e.g. https://inferllama.com) |
| --username <string> | Username (prompted interactively if omitted) |
| --password <string> | Password (prompted interactively if omitted) |
# Interactive login — replace with your registry URL
inferllama login --server https://inferllama.com
# Non-interactive (CI / automation)
inferllama login --server https://inferllama.com \
--username ci-user \
--password "$INFERLLAMA_PASSWORD"inferllama setup
Install the local inference backend and download accelerator. Run once after install; re-run with --force to upgrade.
inferllama setup [--force]Installs two tools:
- ▸llama-server (llama.cpp) — runs models locally. macOS: via Homebrew. Linux: prebuilt binary →
~/.inferllama/bin/. Windows: via winget or GitHub Releases binary. - ▸aria2c — multi-source download accelerator for 3–5× faster model pulls. Same OS detection as above; failure is non-fatal.
inferllama search
Search the registry for models by name, architecture, or tag.
inferllama search <query>inferllama search qwen
inferllama search llama 7b
inferllama search mistral instructinferllama rm
Remove a locally downloaded model. Shared chunks (used by other models) are kept; only chunks exclusive to this model are deleted.
inferllama rm <model>In-session slash commands
These commands are typed inside an active inferllama run session:
| Command | Description |
|---|---|
| /clear | Clear conversation history and start fresh |
| /bye | End the session and exit |
| /history | Print the full conversation so far |
| /model | Show the current model name, context size, and token usage |
> /model
Model: qwen2-0.5b-instruct
Context: 4096 tokens
Temperature: 0.7
Session: 12 messages, ~830 tokens used
> /clear
Conversation cleared.
> /bye
Goodbye.Config file
The CLI stores its state in ~/.inferllama/config.json. You can edit it directly or use inferllama login to update it.
{
"server_url": "https://inferllama.com",
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
}ℹ Note
http://localhost:8000. See Self-hosting for local Docker Compose setup.See Configuration for the full reference.