> For the complete documentation index, see [llms.txt](https://docs.mozilla.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.mozilla.ai/any-guardrail/api-reference/providers/llamafile.md).

# Llamafile

Run inference through a local `llamafile` binary's HTTP server.

The provider spawns the binary as a subprocess listening on `--host`/ `--port` (server mode is implicit when a port is given in llamafile 0.10+), with `--no-webui` to suppress the UI, then polls `GET /health` for readiness and issues `POST /v1/chat/completions` calls. Output is normalized to the same shape :meth:`HuggingFaceProvider.generate_chat` returns so guardrails are provider-agnostic.

The provider implements the context manager protocol for deterministic cleanup of the spawned subprocess::

```
with LlamafileProvider() as provider:
    guardrail = GraniteGuardian(
        criteria=GraniteGuardianRisk.HARM, provider=provider
    )
    result = guardrail.validate("hello")
# subprocess is terminated here, even if validate() raised.
```

Outside a `with` block the provider still cleans up via `atexit` on interpreter exit, so notebook and REPL usage works without explicit teardown. Call `provider.close()` directly to release the port early.

Args: binary\_path: Path to a pre-downloaded `.llamafile`. If omitted, the artifact is auto-downloaded — first by trying `repo_id`/`filename` if both were supplied, otherwise by looking up the `model_id` passed to `load_model` in the curated :data:`~any_guardrail.providers._llamafile_artifacts.LLAMAFILE_ARTIFACTS` map. Mutually exclusive with `base_url`. repo\_id: Power-user override for the HuggingFace repo containing the llamafile. Used together with `filename`. Mutually exclusive with `base_url`. filename: Power-user override for the artifact filename inside `repo_id`. Used together with `repo_id`. Mutually exclusive with `base_url`. base\_url: External-server mode. Point at a llamafile server you spun up yourself (e.g. `"http://localhost:9999"`). When set, the provider skips download + subprocess spawn entirely; `load_model` only polls the server for readiness, and `close()` is a no-op. Mutually exclusive with `binary_path`, `repo_id`/`filename`, `port`, `n_gpu_layers`, `context_size`, and `extra_args`. Must start with `http://` or `https://`. port: TCP port to bind the llamafile HTTP server. Defaults to a kernel-chosen free port. Mutually exclusive with `base_url`. host: Bind address. Defaults to `"127.0.0.1"`. startup\_timeout: Seconds to wait for the server to become ready. Llamafiles can take \~30s to memory-map and warm up; the default is generous. Also applies to external-server readiness polling. request\_timeout: Per-request timeout for `/v1/chat/completions`. cache\_dir: Directory passed to `hf_hub_download` for auto-downloaded binaries. n\_gpu\_layers: Optional number of model layers to offload to GPU. Passed as `--n-gpu-layers`. `None` (default) lets llamafile decide. Mutually exclusive with `base_url`. context\_size: Optional context window size. Passed as `--ctx-size`. Mutually exclusive with `base_url`. extra\_args: Optional list of additional command-line arguments appended after the standard server flags. Use this for advanced llamafile flags not surfaced above. Mutually exclusive with `base_url`.

## Constructor

| Parameter         | Type         | Required | Default       |
| ----------------- | ------------ | -------- | ------------- |
| `binary_path`     | \`str        | None\`   | No            |
| `repo_id`         | \`str        | None\`   | No            |
| `filename`        | \`str        | None\`   | No            |
| `base_url`        | \`str        | None\`   | No            |
| `port`            | \`int        | None\`   | No            |
| `host`            | `str`        | No       | `"127.0.0.1"` |
| `startup_timeout` | `float`      | No       | `120.0`       |
| `request_timeout` | `float`      | No       | `120.0`       |
| `cache_dir`       | \`str        | None\`   | No            |
| `n_gpu_layers`    | \`int        | None\`   | No            |
| `context_size`    | \`int        | None\`   | No            |
| `extra_args`      | \`list\[str] | None\`   | No            |

Initialize the llamafile provider.

## load\_model

Resolve the llamafile binary for `model_id` and start its HTTP server.

If we auto-pick the port and the subprocess fails to come up (e.g. another process grabbed the port between our `_free_port()` probe and the binary's `bind()`), retry up to :attr:`_BIND_RACE_RETRIES` times with a fresh port. When the caller pinned a port via the `port=` constructor argument, no retry: surface the failure immediately.

In external-server mode (`base_url` supplied to the constructor), the binary lookup and subprocess spawn are skipped — the provider only polls the user's server for readiness.

**Parameters**

| Parameter  | Type  | Required | Default |
| ---------- | ----- | -------- | ------- |
| `model_id` | `str` | Yes      | —       |

**Returns:** `None`

## pre\_process

Not supported — llamafile is a chat-style backend.

Use :meth:`generate_chat` instead. Decoder-LLM guardrails like :class:`GraniteGuardian` route through `generate_chat` automatically.

**Returns:** `GuardrailPreprocessOutput[AnyDict]`

## infer

Not supported — llamafile is a chat-style backend.

Use :meth:`generate_chat` instead.

**Parameters**

| Parameter      | Type                                 | Required | Default |
| -------------- | ------------------------------------ | -------- | ------- |
| `model_inputs` | `GuardrailPreprocessOutput[AnyDict]` | Yes      | —       |

**Returns:** `GuardrailInferenceOutput[AnyDict]`

## close

Terminate the llamafile subprocess. Idempotent.

In external-server mode there is no subprocess to terminate and `self.base_url` is preserved so the provider stays reusable.

**Returns:** `None`


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.mozilla.ai/any-guardrail/api-reference/providers/llamafile.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.