Documentation

CLI reference

Complete reference for every inferllama command, flag, and in-session slash command.

Global flags

Flag	Description
--help	Show help for any command
--version	Print the installed CLI version
--server <url>	Override the registry URL for this invocation only

inferllama run

Pull the model if needed, start a local inference server, and open an interactive streaming chat session — all in one command. This is the primary way to use InferLlama.

bash

inferllama run <model> [options]

Flag	Default	Description
--port <int>	8080	Port for the local llama-server process
--ctx <int>	4096	Context window size in tokens
--system <string>	—	System prompt prepended to every conversation
--temperature <float>	0.7	Sampling temperature (0.0 = deterministic, 2.0 = very random)
--max-tokens <int>	—	Maximum tokens per response (unlimited by default)
--server-only	false	Start inference server without opening chat (useful for integrations)

bash

# Basic interactive chat
inferllama run qwen2-0.5b-instruct

# Custom system prompt and low temperature for coding
inferllama run qwen2-0.5b-instruct \
  --system "You are a concise expert software engineer." \
  --temperature 0.2

# Start inference server only, don't open chat
inferllama run llama3-8b-instruct --server-only --port 9000

✦ Tip

If the model is not yet downloaded, inferllama run automatically pulls it first — no need to run inferllama pull separately.

inferllama pull

Download a model from the registry to local storage without starting inference. Uses content-addressed 256 MB chunks — chunks already present locally are skipped, so fine-tunes that share a base model only download the new bytes.

bash

inferllama pull <model>

bash

inferllama pull qwen2-0.5b-instruct
inferllama pull llama3-8b-instruct

Downloads are shown as a progress bar with transfer speed and estimated time remaining. If the registry exposes a .torrent for the model, InferLlama automatically tries the torrent path first (via aria2c if installed) and falls back to direct HTTP if no peers are available.

inferllama list

List all models stored on the local machine.

bash

inferllama list

text

NAME                      SIZE      MODIFIED
qwen2-0.5b-instruct       954 MB    2 hours ago
llama3-8b-instruct        4.7 GB    yesterday
mistral-7b-instruct       4.1 GB    3 days ago

inferllama serve

Start an OpenAI-compatible inference proxy daemon. The daemon listens on a local port and exposes POST /v1/chat/completions compatible with any OpenAI SDK, LangChain, or HTTP client.

bash

inferllama serve [options]

Flag	Default	Description
--port <int>	11434	Port for the daemon to listen on

bash

# Start the daemon (foreground)
inferllama serve

# Verify it is running
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2-0.5b-instruct","messages":[{"role":"user","content":"ping"}]}'

ℹ Note

The daemon stays in the foreground. To run it persistently, use a process supervisor:systemd, launchd, or pm2 start "inferllama serve".

inferllama push

Upload a local model file to the registry. The file is split into 256 MB content-addressed chunks. If you already uploaded the base model, fine-tune variants only transfer the new chunks — identical blocks are deduplicated automatically.

bash

inferllama push <file> [options]

Flag	Description
--name <string>	Registry model name (defaults to filename stem)
--display-name <string>	Human-readable display name
--arch <string>	Architecture (llama, mistral, qwen, phi, gemma, …)
--quant <string>	Quantization (Q4_K_M, Q8_0, F16, …)
--license <string>	License identifier (MIT, Apache-2.0, …)
--tags <list>	Comma-separated tags
--description <string>	Short model description

bash

inferllama push ./my-model-7b-q4_k_m.gguf \
  --name my-model-7b \
  --arch llama \
  --quant Q4_K_M \
  --license MIT

Authenticate with a registry server. The token is saved to~/.inferllama/config.json.

bash

inferllama login [options]

Flag	Description
--server <url>	Your registry server URL (e.g. https://inferllama.com)
--username <string>	Username (prompted interactively if omitted)
--password <string>	Password (prompted interactively if omitted)

bash

# Interactive login — replace with your registry URL
inferllama login --server https://inferllama.com

# Non-interactive (CI / automation)
inferllama login --server https://inferllama.com \
  --username ci-user \
  --password "$INFERLLAMA_PASSWORD"

inferllama setup

Install the local inference backend and download accelerator. Run once after install; re-run with --force to upgrade.

bash

inferllama setup [--force]

Installs two tools:

▸llama-server (llama.cpp) — runs models locally. macOS: via Homebrew. Linux: prebuilt binary → ~/.inferllama/bin/. Windows: via winget or GitHub Releases binary.
▸aria2c — multi-source download accelerator for 3–5× faster model pulls. Same OS detection as above; failure is non-fatal.

inferllama search

Search the registry for models by name, architecture, or tag.

bash

inferllama search <query>

bash

inferllama search qwen
inferllama search llama 7b
inferllama search mistral instruct

inferllama rm

Remove a locally downloaded model. Shared chunks (used by other models) are kept; only chunks exclusive to this model are deleted.

bash

inferllama rm <model>

In-session slash commands

These commands are typed inside an active inferllama run session:

Command	Description
/clear	Clear conversation history and start fresh
/bye	End the session and exit
/history	Print the full conversation so far
/model	Show the current model name, context size, and token usage

text

> /model
Model:       qwen2-0.5b-instruct
Context:     4096 tokens
Temperature: 0.7
Session:     12 messages, ~830 tokens used

> /clear
Conversation cleared.

> /bye
Goodbye.

Config file

The CLI stores its state in ~/.inferllama/config.json. You can edit it directly or use inferllama login to update it.

json

{
  "server_url": "https://inferllama.com",
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
}

ℹ Note

When running locally for development, replace the server URL with http://localhost:8000. See Self-hosting for local Docker Compose setup.

See Configuration for the full reference.

Quickstart API reference