Documentation
API reference
REST API for browsing and downloading models from the InferLlama registry.
Base URL
All API paths are relative to the registry base URL:
| Environment | Base URL |
|---|---|
| InferLlama cloud | https://api.inferllama.com |
| Self-hosted (local) | http://localhost:8000 |
All paths begin with /v1/.
Authentication
Pass a Bearer token in the Authorization header on every authenticated request. Obtain a session token by logging in, or create a long-lived API key from Settings.
curl https://api.inferllama.com/v1/models \
-H "Authorization: Bearer <your-token>"✦ Tip
Errors
All errors return a JSON body with a detail field:
{ "detail": "Model not found" }| Status | Meaning |
|---|---|
| 400 | Bad request — missing or invalid field |
| 401 | Unauthorized — missing or expired token |
| 403 | Forbidden — authenticated but not permitted |
| 404 | Not found |
| 409 | Conflict — e.g. username already taken |
| 422 | Validation error — request body schema mismatch |
| 429 | Rate limited — retry after the Retry-After header value |
| 500 | Internal server error |
Models
List models
/v1/models| Query param | Type | Description |
|---|---|---|
| q | string | Full-text search query |
| architecture | string | Filter by architecture (llama, mistral, qwen, …) |
| limit | int | Max results (default 20, max 200) |
| offset | int | Pagination offset |
| sort | string | Sort by: downloads (default), likes, created_at |
curl "https://api.inferllama.com/v1/models?q=qwen&limit=10"// Response 200
[
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"name": "qwen2-0.5b-instruct",
"display_name": "Qwen2 0.5B Instruct",
"description": "A compact yet capable instruction-following model.",
"license": "Apache-2.0",
"architecture": "qwen",
"parameter_count": 500000000,
"context_length": 32768,
"tags": ["chat", "instruct"],
"downloads": 1042,
"likes": 87,
"is_public": true
}
]Get model
/v1/models/{id}Fetch full metadata for a single model by ID or slug name.
curl https://api.inferllama.com/v1/models/qwen2-0.5b-instructLike / unlike
/v1/models/{id}/like/v1/models/{id}/likeToggle a like on a model. Requires authentication. Returns 204 No Content.
List model files
/v1/models/{id}/filesReturns all GGUF files available for a model, including quantization and size.
// Response 200
[
{
"id": "…",
"model_id": "…",
"filename": "qwen2-0.5b-instruct-q4_k_m.gguf",
"quantization": "Q4_K_M",
"file_size_bytes": 954000000,
"sha256_hash": "a1b2c3d4…",
"format": "gguf",
"status": "ready",
"torrent_info_hash": "deadbeef…"
}
]Downloads
Get download URL
/v1/files/{model_id}/download/{file_id}Returns a presigned download URL (valid 24 h), a magnet URI for BitTorrent, and the chunk manifest used by the CLI for fast parallel downloads. Also increments the download counter.
// Response 200
{
"direct_url": "https://r2.inferllama.com/models/…?X-Amz-Expires=86400&…",
"torrent_url": "https://r2.inferllama.com/torrents/….torrent",
"magnet_uri": "magnet:?xt=urn:btih:deadbeef…&dn=qwen2-0.5b…&ws=…",
"torrent_info_hash": "deadbeef1234…",
"chunk_hashes": [
{ "index": 0, "hash": "a1b2c3…", "size": 268435456 },
{ "index": 1, "hash": "d4e5f6…", "size": 268435456 }
],
"file_size_bytes": 954000000,
"sha256_hash": "a1b2c3d4…"
}✦ Tip
magnet_uri with any BitTorrent client (aria2c, qBittorrent, Transmission) for faster multi-source downloads. The web seed in the magnet URI falls back to HTTP if no peers are available.Download chunk
/v1/files/chunks/{chunk_hash}Redirects (302) to a presigned chunk URL. Used by the CLI for parallel chunk downloads and as a BitTorrent HTTP web seed.
Inference (OpenAI-compatible)
When inferllama serve is running locally, it exposes an OpenAI-compatible API on port 11434. Point any OpenAI SDK at http://localhost:11434/v1 with any non-empty API key.
ℹ Note
Chat completions
/v1/chat/completions// Request (to http://localhost:11434/v1/chat/completions)
{
"model": "qwen2-0.5b-instruct",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain transformers in simple terms." }
],
"stream": true,
"temperature": 0.7,
"max_tokens": 1024
}When stream: true, the response is a stream of data: … SSE events terminated by data: [DONE] — identical to the OpenAI streaming format.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="local")
for chunk in client.chat.completions.create(
model="qwen2-0.5b-instruct",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
):
print(chunk.choices[0].delta.content or "", end="", flush=True)List loaded models
/v1/modelsReturns models currently loaded in the inference daemon and ready to serve requests.