Documentation

Quickstart

Get InferLlama installed and running a local AI model in under five minutes.

Prerequisites

▸macOS 12+, Windows 10/11, or Linux (Ubuntu 20.04+, Debian 11+, Fedora 38+, Arch, Alpine, and more)
▸Python 3.11 or later — check with python3 --version
▸4 GB RAM minimum; 8 GB recommended for 7B-parameter models
▸No root / admin access required — everything installs into your user directories

Step 1 — Install the CLI

Choose your operating system below. The installer also runs inferllama setup automatically, which installs the inference backend (llama-server) and download accelerator (aria2c).

macOS

Run the bootstrap script from inside the repository:

bash

curl -fsSL https://cli.inferllama.com/install.sh | bash

llama-server and aria2c are installed via Homebrew. If Homebrew is not present the installer falls back to prebuilt binaries in ~/.inferllama/bin/.

✦ Tip

Or install the CLI directly: pip install inferllama-cli && inferllama setup

Linux

All major distributions are supported:

bash

curl -fsSL https://cli.inferllama.com/install.sh | bash

The script auto-detects your package manager (apt-get, dnf, yum, pacman, zypper, apk) and installs dependencies through it. If none are available, prebuilt binaries are downloaded to ~/.inferllama/bin/.

ℹ Note

The script requires no sudo. If your package manager needs it, you will be prompted only for that step.

Windows

Open PowerShell (no admin rights needed) and run:

powershell

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\scripts\install.ps1

The installer adds %USERPROFILE%\.inferllama\bin and Python's Scripts folder to your user PATH automatically. Restart PowerShell after it completes for PATH changes to take effect.

✦ Tip

llama-server and aria2c are installed via winget when available, otherwise downloaded as prebuilt binaries directly from GitHub Releases — no admin rights, no manual downloads.

Verify

Once installed, confirm the CLI is working:

bash

inferllama --version

If this errors, see the PATH note in the Configuration docs.

Step 2 — Install the inference backend

The bootstrap script calls this automatically, but you can also run it manually at any time — for example to upgrade or reinstall:

bash

inferllama setup

This installs two tools:

▸llama-server — the local inference engine (llama.cpp). Runs models entirely on your machine.
▸aria2c — multi-source download accelerator. Enables BitTorrent and parallel HTTP downloads so large models pull 3–5× faster.

✦ Tip

Force a re-install or upgrade: inferllama setup --force

Step 3 — Connect to a registry

A registry is the InferLlama server that stores model metadata, files, and user accounts — similar to how Docker Hub is a registry for container images. Before you can pull or push models, you need to tell the CLI where your registry lives and log in.

Point the CLI at the InferLlama registry and log in with your account.

bash

inferllama login --server https://inferllama.com

You will be prompted for your username and password. The token is saved to ~/.inferllama/config.json and used automatically for all future commands.

✦ Tip

Skip the prompt by passing credentials inline (useful for CI):
inferllama login --server https://… --username alice --password hunter2

Step 4 — Run your first model

Pull and launch an interactive chat session in one command. InferLlama downloads the model if it is not already cached locally, starts the inference server in the background, and opens a streaming REPL:

bash

inferllama run qwen2-0.5b-instruct

text

Loading qwen2-0.5b-instruct-q4_k_m.gguf…

You: Hello! What can you help me with?
Assistant: I can help with writing, coding, analysis, math, research,
and much more. What would you like to work on?

You: /bye
Goodbye.

Useful in-session slash commands:

▸/clear — wipe conversation history and start fresh
▸/history — print the full conversation so far
▸/model — show model name, context size, and token usage
▸/bye — end the session

✦ Tip

To download only (no chat): inferllama pull qwen2-0.5b-instruct

Use the OpenAI-compatible API

Start the local inference daemon, which exposes an OpenAI-compatible API on port 11434:

bash

inferllama serve

Any app that works with OpenAI can now be pointed at http://localhost:11434/v1 with api_key="local" and it will use your local model.

curl

bash

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-0.5b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about open source."}],
    "stream": false
  }'

Python (OpenAI SDK)

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="local",   # any non-empty string works
)

resp = client.chat.completions.create(
    model="qwen2-0.5b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

JavaScript / TypeScript

typescript

import OpenAI from 'openai'

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'local',
})

const resp = await client.chat.completions.create({
  model: 'qwen2-0.5b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
})
console.log(resp.choices[0].message.content)

Web UI

The web interface is available at inferllama.com.

▸Browse and search the model registry
▸Chat with any running model in the Playground
▸Create and revoke API keys from Settings

Next steps

▸CLI reference — every command, flag, and option
▸API reference — REST API for programmatic access

CLI reference