Skip to main content

CLI reference

Complete reference for every inferllama command, flag, and in-session slash command.

Global flags

FlagDescription
--helpShow help for any command
--versionPrint the installed CLI version
--server <url>Override the registry URL for this invocation only

inferllama run

Pull the model if needed, start a local inference server, and open an interactive streaming chat session — all in one command. This is the primary way to use InferLlama.

bash
inferllama run <model> [options]
FlagDefaultDescription
--port <int>8080Port for the local llama-server process
--ctx <int>4096Context window size in tokens
--system <string>System prompt prepended to every conversation
--temperature <float>0.7Sampling temperature (0.0 = deterministic, 2.0 = very random)
--max-tokens <int>Maximum tokens per response (unlimited by default)
--server-onlyfalseStart inference server without opening chat (useful for integrations)
bash
# Basic interactive chat
inferllama run qwen2-0.5b-instruct

# Custom system prompt and low temperature for coding
inferllama run qwen2-0.5b-instruct \
  --system "You are a concise expert software engineer." \
  --temperature 0.2

# Start inference server only, don't open chat
inferllama run llama3-8b-instruct --server-only --port 9000

Tip

If the model is not yet downloaded, inferllama run automatically pulls it first — no need to run inferllama pull separately.

inferllama pull

Download a model from the registry to local storage without starting inference. Uses content-addressed 256 MB chunks — chunks already present locally are skipped, so fine-tunes that share a base model only download the new bytes.

bash
inferllama pull <model>
bash
inferllama pull qwen2-0.5b-instruct
inferllama pull llama3-8b-instruct

Downloads are shown as a progress bar with transfer speed and estimated time remaining. If the registry exposes a .torrent for the model, InferLlama automatically tries the torrent path first (via aria2c if installed) and falls back to direct HTTP if no peers are available.


inferllama list

List all models stored on the local machine.

bash
inferllama list
text
NAME                      SIZE      MODIFIED
qwen2-0.5b-instruct       954 MB    2 hours ago
llama3-8b-instruct        4.7 GB    yesterday
mistral-7b-instruct       4.1 GB    3 days ago

inferllama serve

Start an OpenAI-compatible inference proxy daemon. The daemon listens on a local port and exposes POST /v1/chat/completions compatible with any OpenAI SDK, LangChain, or HTTP client.

bash
inferllama serve [options]
FlagDefaultDescription
--port <int>11434Port for the daemon to listen on
bash
# Start the daemon (foreground)
inferllama serve

# Verify it is running
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2-0.5b-instruct","messages":[{"role":"user","content":"ping"}]}'

Note

The daemon stays in the foreground. To run it persistently, use a process supervisor:systemd, launchd, or pm2 start "inferllama serve".

inferllama push

Upload a local model file to the registry. The file is split into 256 MB content-addressed chunks. If you already uploaded the base model, fine-tune variants only transfer the new chunks — identical blocks are deduplicated automatically.

bash
inferllama push <file> [options]
FlagDescription
--name <string>Registry model name (defaults to filename stem)
--display-name <string>Human-readable display name
--arch <string>Architecture (llama, mistral, qwen, phi, gemma, …)
--quant <string>Quantization (Q4_K_M, Q8_0, F16, …)
--license <string>License identifier (MIT, Apache-2.0, …)
--tags <list>Comma-separated tags
--description <string>Short model description
bash
inferllama push ./my-model-7b-q4_k_m.gguf \
  --name my-model-7b \
  --arch llama \
  --quant Q4_K_M \
  --license MIT

inferllama login

Authenticate with a registry server. The token is saved to~/.inferllama/config.json.

bash
inferllama login [options]
FlagDescription
--server <url>Your registry server URL (e.g. https://inferllama.com)
--username <string>Username (prompted interactively if omitted)
--password <string>Password (prompted interactively if omitted)
bash
# Interactive login — replace with your registry URL
inferllama login --server https://inferllama.com

# Non-interactive (CI / automation)
inferllama login --server https://inferllama.com \
  --username ci-user \
  --password "$INFERLLAMA_PASSWORD"

inferllama setup

Install the local inference backend and download accelerator. Run once after install; re-run with --force to upgrade.

bash
inferllama setup [--force]

Installs two tools:

  • llama-server (llama.cpp) — runs models locally. macOS: via Homebrew. Linux: prebuilt binary → ~/.inferllama/bin/. Windows: via winget or GitHub Releases binary.
  • aria2c — multi-source download accelerator for 3–5× faster model pulls. Same OS detection as above; failure is non-fatal.

Search the registry for models by name, architecture, or tag.

bash
inferllama search <query>
bash
inferllama search qwen
inferllama search llama 7b
inferllama search mistral instruct

inferllama rm

Remove a locally downloaded model. Shared chunks (used by other models) are kept; only chunks exclusive to this model are deleted.

bash
inferllama rm <model>

In-session slash commands

These commands are typed inside an active inferllama run session:

CommandDescription
/clearClear conversation history and start fresh
/byeEnd the session and exit
/historyPrint the full conversation so far
/modelShow the current model name, context size, and token usage
text
> /model
Model:       qwen2-0.5b-instruct
Context:     4096 tokens
Temperature: 0.7
Session:     12 messages, ~830 tokens used

> /clear
Conversation cleared.

> /bye
Goodbye.

Config file

The CLI stores its state in ~/.inferllama/config.json. You can edit it directly or use inferllama login to update it.

json
{
  "server_url": "https://inferllama.com",
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
}

Note

When running locally for development, replace the server URL with http://localhost:8000. See Self-hosting for local Docker Compose setup.

See Configuration for the full reference.