Documentation
Quickstart
Get InferLlama installed and running a local AI model in under five minutes.
Prerequisites
- ▸macOS 12+, Windows 10/11, or Linux (Ubuntu 20.04+, Debian 11+, Fedora 38+, Arch, Alpine, and more)
- ▸Python 3.11 or later — check with
python3 --version - ▸4 GB RAM minimum; 8 GB recommended for 7B-parameter models
- ▸No root / admin access required — everything installs into your user directories
Step 1 — Install the CLI
Choose your operating system below. The installer also runs inferllama setup automatically, which installs the inference backend (llama-server) and download accelerator (aria2c).
macOS
Run the bootstrap script from inside the repository:
curl -fsSL https://cli.inferllama.com/install.sh | bashllama-server and aria2c are installed via Homebrew. If Homebrew is not present the installer falls back to prebuilt binaries in ~/.inferllama/bin/.
✦ Tip
pip install inferllama-cli && inferllama setupLinux
All major distributions are supported:
curl -fsSL https://cli.inferllama.com/install.sh | bashThe script auto-detects your package manager (apt-get, dnf, yum, pacman, zypper, apk) and installs dependencies through it. If none are available, prebuilt binaries are downloaded to ~/.inferllama/bin/.
ℹ Note
sudo. If your package manager needs it, you will be prompted only for that step.Windows
Open PowerShell (no admin rights needed) and run:
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\scripts\install.ps1The installer adds %USERPROFILE%\.inferllama\bin and Python's Scripts folder to your user PATH automatically. Restart PowerShell after it completes for PATH changes to take effect.
✦ Tip
llama-server and aria2c are installed via winget when available, otherwise downloaded as prebuilt binaries directly from GitHub Releases — no admin rights, no manual downloads.Verify
Once installed, confirm the CLI is working:
inferllama --versionIf this errors, see the PATH note in the Configuration docs.
Step 2 — Install the inference backend
The bootstrap script calls this automatically, but you can also run it manually at any time — for example to upgrade or reinstall:
inferllama setupThis installs two tools:
- ▸llama-server — the local inference engine (llama.cpp). Runs models entirely on your machine.
- ▸aria2c — multi-source download accelerator. Enables BitTorrent and parallel HTTP downloads so large models pull 3–5× faster.
✦ Tip
inferllama setup --forceStep 3 — Connect to a registry
A registry is the InferLlama server that stores model metadata, files, and user accounts — similar to how Docker Hub is a registry for container images. Before you can pull or push models, you need to tell the CLI where your registry lives and log in.
Point the CLI at the InferLlama registry and log in with your account.
inferllama login --server https://inferllama.comYou will be prompted for your username and password. The token is saved to ~/.inferllama/config.json and used automatically for all future commands.
✦ Tip
inferllama login --server https://… --username alice --password hunter2Step 4 — Run your first model
Pull and launch an interactive chat session in one command. InferLlama downloads the model if it is not already cached locally, starts the inference server in the background, and opens a streaming REPL:
inferllama run qwen2-0.5b-instructLoading qwen2-0.5b-instruct-q4_k_m.gguf…
You: Hello! What can you help me with?
Assistant: I can help with writing, coding, analysis, math, research,
and much more. What would you like to work on?
You: /bye
Goodbye.Useful in-session slash commands:
- ▸
/clear— wipe conversation history and start fresh - ▸
/history— print the full conversation so far - ▸
/model— show model name, context size, and token usage - ▸
/bye— end the session
✦ Tip
inferllama pull qwen2-0.5b-instructUse the OpenAI-compatible API
Start the local inference daemon, which exposes an OpenAI-compatible API on port 11434:
inferllama serveAny app that works with OpenAI can now be pointed at http://localhost:11434/v1 with api_key="local" and it will use your local model.
curl
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2-0.5b-instruct",
"messages": [{"role": "user", "content": "Write a haiku about open source."}],
"stream": false
}'Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="local", # any non-empty string works
)
resp = client.chat.completions.create(
model="qwen2-0.5b-instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)JavaScript / TypeScript
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'local',
})
const resp = await client.chat.completions.create({
model: 'qwen2-0.5b-instruct',
messages: [{ role: 'user', content: 'Hello!' }],
})
console.log(resp.choices[0].message.content)Web UI
The web interface is available at inferllama.com.
- ▸Browse and search the model registry
- ▸Chat with any running model in the Playground
- ▸Create and revoke API keys from Settings
Next steps
- ▸CLI reference — every command, flag, and option
- ▸API reference — REST API for programmatic access