Skip to main content

API reference

REST API for browsing and downloading models from the InferLlama registry.

Base URL

All API paths are relative to the registry base URL:

EnvironmentBase URL
InferLlama cloudhttps://api.inferllama.com
Self-hosted (local)http://localhost:8000

All paths begin with /v1/.

Authentication

Pass a Bearer token in the Authorization header on every authenticated request. Obtain a session token by logging in, or create a long-lived API key from Settings.

bash
curl https://api.inferllama.com/v1/models \
  -H "Authorization: Bearer <your-token>"

Tip

For scripts and CI pipelines, use a long-lived API key rather than a session token. Create one from the Settings page in the web UI.

Errors

All errors return a JSON body with a detail field:

json
{ "detail": "Model not found" }
StatusMeaning
400Bad request — missing or invalid field
401Unauthorized — missing or expired token
403Forbidden — authenticated but not permitted
404Not found
409Conflict — e.g. username already taken
422Validation error — request body schema mismatch
429Rate limited — retry after the Retry-After header value
500Internal server error

Models

List models

GET/v1/models
Query paramTypeDescription
qstringFull-text search query
architecturestringFilter by architecture (llama, mistral, qwen, …)
limitintMax results (default 20, max 200)
offsetintPagination offset
sortstringSort by: downloads (default), likes, created_at
bash
curl "https://api.inferllama.com/v1/models?q=qwen&limit=10"
json
// Response 200
[
  {
    "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
    "name": "qwen2-0.5b-instruct",
    "display_name": "Qwen2 0.5B Instruct",
    "description": "A compact yet capable instruction-following model.",
    "license": "Apache-2.0",
    "architecture": "qwen",
    "parameter_count": 500000000,
    "context_length": 32768,
    "tags": ["chat", "instruct"],
    "downloads": 1042,
    "likes": 87,
    "is_public": true
  }
]

Get model

GET/v1/models/{id}

Fetch full metadata for a single model by ID or slug name.

bash
curl https://api.inferllama.com/v1/models/qwen2-0.5b-instruct

Like / unlike

POST/v1/models/{id}/like
DELETE/v1/models/{id}/like

Toggle a like on a model. Requires authentication. Returns 204 No Content.

List model files

GET/v1/models/{id}/files

Returns all GGUF files available for a model, including quantization and size.

json
// Response 200
[
  {
    "id": "…",
    "model_id": "…",
    "filename": "qwen2-0.5b-instruct-q4_k_m.gguf",
    "quantization": "Q4_K_M",
    "file_size_bytes": 954000000,
    "sha256_hash": "a1b2c3d4…",
    "format": "gguf",
    "status": "ready",
    "torrent_info_hash": "deadbeef…"
  }
]

Downloads

Get download URL

GET/v1/files/{model_id}/download/{file_id}

Returns a presigned download URL (valid 24 h), a magnet URI for BitTorrent, and the chunk manifest used by the CLI for fast parallel downloads. Also increments the download counter.

json
// Response 200
{
  "direct_url": "https://r2.inferllama.com/models/…?X-Amz-Expires=86400&…",
  "torrent_url": "https://r2.inferllama.com/torrents/….torrent",
  "magnet_uri": "magnet:?xt=urn:btih:deadbeef…&dn=qwen2-0.5b…&ws=…",
  "torrent_info_hash": "deadbeef1234…",
  "chunk_hashes": [
    { "index": 0, "hash": "a1b2c3…", "size": 268435456 },
    { "index": 1, "hash": "d4e5f6…", "size": 268435456 }
  ],
  "file_size_bytes": 954000000,
  "sha256_hash": "a1b2c3d4…"
}

Tip

Use magnet_uri with any BitTorrent client (aria2c, qBittorrent, Transmission) for faster multi-source downloads. The web seed in the magnet URI falls back to HTTP if no peers are available.

Download chunk

GET/v1/files/chunks/{chunk_hash}

Redirects (302) to a presigned chunk URL. Used by the CLI for parallel chunk downloads and as a BitTorrent HTTP web seed.


Inference (OpenAI-compatible)

When inferllama serve is running locally, it exposes an OpenAI-compatible API on port 11434. Point any OpenAI SDK at http://localhost:11434/v1 with any non-empty API key.

Note

The local inference endpoint is separate from the registry API. The registry handles model metadata and downloads; the inference daemon handles running models on your machine.

Chat completions

POST/v1/chat/completions
json
// Request (to http://localhost:11434/v1/chat/completions)
{
  "model": "qwen2-0.5b-instruct",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user",   "content": "Explain transformers in simple terms." }
  ],
  "stream": true,
  "temperature": 0.7,
  "max_tokens": 1024
}

When stream: true, the response is a stream of data: … SSE events terminated by data: [DONE] — identical to the OpenAI streaming format.

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="local")

for chunk in client.chat.completions.create(
    model="qwen2-0.5b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="", flush=True)

List loaded models

GET/v1/models

Returns models currently loaded in the inference daemon and ready to serve requests.