Core LlamaCPP #

`serapeum.llama_cpp.llm` #

LlamaCPP provider — local GGUF inference via llama-cpp-python.

Contains the :class:LlamaCPP class, a concrete :class:~serapeum.core.llms.LLM implementation that runs quantised GGUF models on-device using the llama-cpp-python <https://github.com/abetlen/llama-cpp-python>_ backend.

Key capabilities:

Model sources: local path, direct URL download, or HuggingFace Hub.
Prompt formatters: pluggable messages_to_prompt / completion_to_prompt per model family (Llama 2, Llama 3, …).
GPU offloading: n_gpu_layers controls layer offloading via cuBLAS / Metal / Vulkan.
Model caching: a module-level :class:weakref.WeakValueDictionary reuses loaded Llama instances across :class:LlamaCPP objects with identical model path and kwargs.
Async-safe: :meth:~LlamaCPP.acomplete offloads CPU-bound inference to a thread pool so the event loop is never blocked.

See Also

serapeum.llama_cpp.formatters: Ready-made prompt formatters. serapeum.llama_cpp.utils: Internal download helpers.

`LlamaCPP` #

Bases: Retry, CompletionToChat, LLM

LlamaCPP LLM — local inference via llama-cpp-python.

Runs GGUF models locally using the llama-cpp-python backend. The model is loaded (or downloaded) once at construction time.

messages_to_prompt and completion_to_prompt are required. GGUF models each have a specific chat template; using the wrong one produces garbage output. Pass the formatter that matches the model family you are loading. Ready-made formatters live in serapeum.llama_cpp.formatters.

Warning

Construction is blocking. Loading a large GGUF file can take 10-30 seconds. To construct inside an async context without blocking the event loop wrap the call in asyncio.to_thread::

llm = await asyncio.to_thread(LlamaCPP, model_path="...", ...)

Examples:

Load a model with Llama 3 formatters and explore the instance

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm_v3 = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> llm_v3.temperature
0.1
>>> llm_v3.max_new_tokens
256
>>> llm_v3.context_window
512

Load a model with Llama 2 formatters and different settings

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama2 import (
...     messages_to_prompt,
...     completion_to_prompt,
... )
>>> llm_v2 = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.5,
...     max_new_tokens=128,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt,
...     completion_to_prompt=completion_to_prompt,
... )
>>> llm_v2.temperature
0.5
>>> llm_v2.max_new_tokens
128
>>> LlamaCPP.class_name()
'LlamaCPP'

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/llm.py

class LlamaCPP(Retry, CompletionToChat, LLM):  # type: ignore[misc]
    """LlamaCPP LLM — local inference via llama-cpp-python.

    Runs GGUF models locally using the llama-cpp-python backend.  The model is
    loaded (or downloaded) once at construction time.

    ``messages_to_prompt`` and ``completion_to_prompt`` are **required**.
    GGUF models each have a specific chat template; using the wrong one
    produces garbage output.  Pass the formatter that matches the model family
    you are loading.  Ready-made formatters live in
    ``serapeum.llama_cpp.formatters``.

    Warning:
        Construction is **blocking**. Loading a large GGUF file can take
        10-30 seconds. To construct inside an async context without blocking
        the event loop wrap the call in ``asyncio.to_thread``::

            llm = await asyncio.to_thread(LlamaCPP, model_path="...", ...)

    Examples:
        - Load a model with Llama 3 formatters and explore the instance
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm_v3 = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> llm_v3.temperature
            0.1
            >>> llm_v3.max_new_tokens
            256
            >>> llm_v3.context_window
            512

            ```
        - Load a model with Llama 2 formatters and different settings
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama2 import (
            ...     messages_to_prompt,
            ...     completion_to_prompt,
            ... )
            >>> llm_v2 = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.5,
            ...     max_new_tokens=128,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt,
            ...     completion_to_prompt=completion_to_prompt,
            ... )
            >>> llm_v2.temperature
            0.5
            >>> llm_v2.max_new_tokens
            128
            >>> LlamaCPP.class_name()
            'LlamaCPP'

            ```
    """

    model_config = ConfigDict(arbitrary_types_allowed=True, extra="forbid")
    max_retries: int = Field(
        default=0,
        description="Maximum number of retry attempts. Disabled by default for local inference.",
        ge=0,
    )
    model_url: str | None = Field(
        default=None,
        description="URL of a GGUF model to download and cache locally.",
    )
    model_path: str | None = Field(
        default=None,
        description="Path to a local GGUF model file.",
    )
    hf_model_id: str | None = Field(
        default=None,
        description=(
            "HuggingFace Hub repo ID (e.g. 'TheBloke/Llama-2-13B-chat-GGUF'). "
            "Requires ``pip install huggingface-hub``."
        ),
    )
    hf_filename: str | None = Field(
        default=None,
        description=(
            "Filename within the HuggingFace Hub repo "
            "(e.g. 'llama-2-13b-chat.Q4_0.gguf'). Required when hf_model_id is set."
        ),
    )
    temperature: float = Field(
        default=DEFAULT_TEMPERATURE,
        description="The temperature to use for sampling.",
        ge=0.0,
        le=1.0,
    )
    max_new_tokens: int = Field(
        default=DEFAULT_NUM_OUTPUTS,
        description="The maximum number of tokens to generate.",
        gt=0,
    )
    context_window: int = Field(
        default=DEFAULT_CONTEXT_WINDOW,
        description="The maximum number of context tokens for the model.",
        gt=0,
    )
    n_gpu_layers: int = Field(
        default=0,
        description=(
            "Number of model layers to offload to GPU. "
            "Set to -1 to offload all layers."
        ),
    )
    stop: list[str] = Field(
        default_factory=list,
        description="Token sequences that stop generation (e.g. ['</s>', '<|eot_id|>']).",
    )
    generate_kwargs: dict[str, Any] = Field(
        default_factory=dict, description="Kwargs used for generation."
    )
    model_kwargs: dict[str, Any] = Field(
        default_factory=dict, description="Kwargs used for model initialization."
    )
    verbose: bool = Field(
        default=DEFAULT_MODEL_VERBOSITY,
        description="Whether to print verbose output.",
    )

    _model: Any = PrivateAttr(default=None)
    # Serializes concurrent model calls: llama_cpp releases the GIL during
    # C-level inference, so two asyncio.to_thread calls on the same Llama
    # instance can race and abort().  One lock per LlamaCPP instance is
    # sufficient for the common "one model, many callers" pattern.
    _model_lock: threading.Lock = PrivateAttr(default_factory=threading.Lock)

    @field_validator("model_url")
    @classmethod
    def _validate_model_url(cls, v: str | None) -> str | None:
        """Reject empty-string model_url to prevent silent fallback to default."""
        if v is not None and not v.strip():
            raise ValueError("model_url must be a valid URL, not an empty string.")
        return v

    @field_validator("model_path")
    @classmethod
    def _validate_model_path_exists(cls, v: str | None) -> str | None:
        """Validate that model_path points to an existing file when provided."""
        if v is not None and not Path(v).exists():
            raise ValueError(
                "Provided model path does not exist. "
                "Please check the path or provide a model_url to download."
            )
        return v

    @model_validator(mode="after")
    def _check_model_source(self) -> "LlamaCPP":
        """Ensure the cross-field model source combination is valid."""
        if (
            self.model_path is None
            and self.model_url is None
            and self.hf_model_id is None
        ):
            raise ValueError(
                "One of model_path, model_url, or hf_model_id must be provided. "
                "Set model_path to a local GGUF file, model_url to download one, "
                "or hf_model_id + hf_filename to download from HuggingFace Hub."
            )
        if self.hf_model_id is not None and self.hf_filename is None:
            raise ValueError(
                "hf_filename is required when hf_model_id is provided. "
                "Example: hf_filename='llama-2-13b-chat.Q4_0.gguf'."
            )
        return self

    @model_validator(mode="after")
    def _check_formatters(self) -> "LlamaCPP":
        """Ensure both prompt formatters were explicitly provided by the caller.

        The base LLM silently falls back to a generic lambda when
        messages_to_prompt / completion_to_prompt are omitted, producing no
        instruct template and therefore garbage output from any GGUF instruct
        model.  model_fields_set contains only the fields the caller explicitly
        passed, so omitted fields are detected even if they have a default.
        """
        missing = [
            name
            for name in ("messages_to_prompt", "completion_to_prompt")
            if name not in self.model_fields_set
        ]
        if missing:
            raise ValueError(
                f"LlamaCPP requires explicit prompt formatters: {', '.join(missing)}.\n"
                "Pass a formatter that matches your model's chat template.\n"
                "Ready-made formatters are available in serapeum.llama_cpp.formatters:\n"
                "  Llama 2 / Mistral:  messages_to_prompt, completion_to_prompt\n"
                "  Llama 3:            messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct"
            )
        return self

    @model_validator(mode="before")
    @classmethod
    def _prepare_kwargs(cls, data: Any) -> Any:
        """Merge n_ctx, verbose, and n_gpu_layers defaults into model_kwargs.

        User-supplied model_kwargs take precedence over these defaults.
        """
        if isinstance(data, dict):
            context_window = data.get("context_window", DEFAULT_CONTEXT_WINDOW)
            verbose = data.get("verbose", DEFAULT_MODEL_VERBOSITY)
            n_gpu_layers = data.get("n_gpu_layers", 0)
            model_kwargs = dict(data.get("model_kwargs") or {})
            model_kwargs.setdefault("n_ctx", context_window)
            model_kwargs.setdefault("verbose", verbose)
            model_kwargs.setdefault("n_gpu_layers", n_gpu_layers)
            data = {**data, "model_kwargs": model_kwargs}
        return data

    def model_post_init(self, __context: Any) -> None:
        """Resolve the model path, download if needed, then load the model.

        Called automatically by Pydantic after ``__init__``.  All validation
        has already completed before this method runs; it performs only I/O
        (path resolution, optional download, GGUF loading).

        See Also:
            _resolve_model_path: Locates or downloads the GGUF file.
            _load_model: Loads (or retrieves from cache) the Llama instance.
        """
        model_path = self._resolve_model_path()
        self._model = self._load_model(model_path)

    def _resolve_model_path(self) -> Path:
        """Return the local Path to the GGUF file, downloading it if required.

        Checks :attr:`model_path`, :attr:`hf_model_id`, and :attr:`model_url`
        in that priority order.  Downloads or fetches from HuggingFace Hub
        when a local path is not available.  Sets :attr:`model_path` as a
        side-effect so subsequent reloads skip the network step.

        Returns:
            :class:`~pathlib.Path` pointing to the resolved local GGUF file.

        Raises:
            RuntimeError: If a URL download appears to succeed but the file
                is not present on disk afterwards.

        See Also:
            _load_model: Called immediately after this method in
                :meth:`model_post_init`.
            serapeum.llama_cpp.utils._fetch_model_file: URL download helper.
            serapeum.llama_cpp.utils._fetch_model_file_hf: HuggingFace Hub
                download helper.
        """
        if self.model_path is not None:
            model_path = Path(self.model_path)
        elif self.hf_model_id is not None and self.hf_filename is not None:
            cache_dir = Path(get_cache_dir()) / "models"
            cache_dir.mkdir(parents=True, exist_ok=True)
            model_path = _fetch_model_file_hf(
                self.hf_model_id,
                self.hf_filename,
                cache_dir,
            )
            self.model_path = str(model_path)
        else:
            model_url = self.model_url or DEFAULT_LLAMA_CPP_GGUF_MODEL
            model_path = Path(get_cache_dir()) / "models" / model_url.rsplit("/", 1)[-1]
            if not model_path.exists():
                model_path.parent.mkdir(parents=True, exist_ok=True)
                _fetch_model_file(model_url, model_path)
                if not model_path.exists():
                    raise RuntimeError(
                        f"Download appeared to succeed but model not found at {model_path!r}"
                    )
            self.model_path = str(model_path)
        return model_path

    def _load_model(self, model_path: Path) -> Llama:  # type: ignore[valid-type]
        """Return a Llama instance for *model_path*, reusing the cache if possible.

        Uses double-checked locking so threads loading different models do not
        serialise on a single global lock, while still preventing duplicate
        loads of the same model.  The cache key is a ``(path, kwargs_json)``
        tuple so models with different generation settings are kept separate.

        Args:
            model_path: Absolute path to the local GGUF file.

        Returns:
            A :class:`llama_cpp.Llama` instance — either freshly loaded or
            retrieved from the module-level ``_MODEL_CACHE``.

        See Also:
            _resolve_model_path: Resolves the path before this method is called.
            _MODEL_CACHE: Module-level WeakValueDictionary that holds cached
                Llama instances.
        """
        try:
            kwargs_key = json.dumps(self.model_kwargs, sort_keys=True)
        except TypeError:
            kwargs_key = repr(sorted(self.model_kwargs.items()))

        cache_key = (str(model_path), kwargs_key)

        with _MODEL_CACHE_LOCK:
            result = _MODEL_CACHE.get(cache_key)

        if result is None:
            try:
                loaded = Llama(model_path=str(model_path), **self.model_kwargs)  # type: ignore[operator]
            except Exception as exc:
                raise RuntimeError(
                    f"Failed to load GGUF model from {model_path!s}. "
                    f"model_kwargs={self.model_kwargs!r}. "
                    "Ensure the file is a valid GGUF model and that you have "
                    f"enough memory. Original error: {exc}"
                ) from exc

            with _MODEL_CACHE_LOCK:
                existing = _MODEL_CACHE.get(cache_key)
                if existing is None:
                    _MODEL_CACHE[cache_key] = loaded
                    result = loaded
                else:
                    result = existing

        return result

    @classmethod
    def class_name(cls) -> str:
        """Return the canonical class identifier used in serialisation.

        Returns:
            The string ``"LlamaCPP"``.
        """
        return "LlamaCPP"

    @property
    def metadata(self) -> Metadata:
        """LLM metadata derived from the loaded model's configuration.

        Returns:
            :class:`~serapeum.core.llms.Metadata` instance with:

            - ``context_window``: effective context size from the loaded model.
            - ``num_output``: :attr:`max_new_tokens` configured for generation.
            - ``model_name``: resolved local path to the GGUF file.

        Examples:
            - Inspect metadata fields of a loaded model
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> meta = llm.metadata
                >>> meta.context_window
                512
                >>> meta.num_output
                256
                >>> meta.model_name.split(".")[-1]
                'gguf'

                ```

        See Also:
            class_name: Class identifier used for serialisation.
        """
        return Metadata(
            context_window=self._model.context_params.n_ctx,
            num_output=self.max_new_tokens,
            model_name=self.model_path or "unknown",
        )

    def tokenize(self, text: str) -> list[int]:
        """Return the token IDs for *text* using the loaded model's vocabulary.

        Args:
            text: The input string to tokenize.

        Returns:
            List of integer token IDs produced by the model's tokenizer.

        Examples:
            - Tokenize a short string and explore the token IDs
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> tokens = llm.tokenize("Hello!")
                >>> tokens[0]
                1
                >>> all(t >= 0 for t in tokens)
                True

                ```
            - Longer text produces more tokens
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> short = llm.tokenize("Hi")
                >>> long = llm.tokenize("Hello, how are you doing today?")
                >>> short[0] == long[0]  # both start with the BOS token
                True
                >>> len(long) > len(short)  # more text yields more tokens
                True

                ```

        See Also:
            count_tokens: Returns the token count instead of the full list.
            _guard_context: Uses token count to validate prompt length.
        """
        return self._model.tokenize(text.encode())  # type: ignore[no-any-return]

    def count_tokens(self, text: str) -> int:
        """Return the number of tokens *text* encodes to.

        Args:
            text: The input string to count tokens for.

        Returns:
            Integer token count for *text*.

        Examples:
            - count_tokens is consistent with tokenize
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> llm.count_tokens("Hello!") == len(llm.tokenize("Hello!"))
                True

                ```
            - Longer text yields a higher count
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> short_count = llm.count_tokens("Hi")
                >>> long_count = llm.count_tokens("Hello, how are you doing today?")
                >>> long_count > short_count  # more text yields a higher count
                True
                >>> short_count == len(llm.tokenize("Hi"))  # consistent with tokenize
                True

                ```

        See Also:
            tokenize: Returns the full token ID list.
            _guard_context: Calls this method to check prompt length.
        """
        return len(self.tokenize(text))

    def _guard_context(self, prompt: str) -> None:
        """Raise ValueError if *prompt* exceeds the model's context window.

        Args:
            prompt: Already-formatted prompt string whose token length is
                checked against :attr:`context_window`.

        Raises:
            ValueError: If the token count of *prompt* exceeds
                :attr:`context_window`, reporting the actual count and the
                configured limit.

        Examples:
            - A prompt within the context window raises no error
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> llm._guard_context("Short prompt.")

                ```
            - A prompt that exceeds the context window raises ValueError
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> llm._guard_context("word " * 10_000)  # doctest: +ELLIPSIS
                Traceback (most recent call last):
                    ...
                ValueError: Prompt is ... tokens but context_window is 512. ...

                ```

        See Also:
            count_tokens: Used internally to measure the prompt length.
        """
        n = self.count_tokens(prompt)
        if n > self.context_window:
            raise ValueError(
                f"Prompt is {n} tokens but context_window is {self.context_window}. "
                "Shorten the prompt or increase context_window."
            )

    @overload  # type: ignore[override]
    def complete(
        self,
        prompt: str,
        formatted: bool = ...,
        *,
        stream: Literal[False] = ...,
        **kwargs: Any,
    ) -> CompletionResponse: ...

    @overload
    def complete(
        self,
        prompt: str,
        formatted: bool = ...,
        *,
        stream: Literal[True],
        **kwargs: Any,
    ) -> CompletionResponseGen: ...

    def complete(
        self,
        prompt: str,
        formatted: bool = False,
        *,
        stream: bool = False,
        **kwargs: Any,
    ) -> CompletionResponse | CompletionResponseGen:
        """Run text completion, optionally streaming token-by-token.

        Args:
            prompt: The input text to complete.
            formatted: When ``True``, *prompt* is passed to the model as-is.
                When ``False`` (default) it is first wrapped by
                :attr:`completion_to_prompt` to apply the model's chat template.
            stream: When ``True`` returns a :class:`CompletionResponseGen`
                generator that yields one :class:`CompletionResponse` per
                token delta.  When ``False`` (default) returns a single
                :class:`CompletionResponse` with the full completion.
            **kwargs: Additional keyword arguments forwarded to the underlying
                ``Llama.__call__`` (e.g. ``top_p``, ``repeat_penalty``).

        Returns:
            A :class:`CompletionResponse` when ``stream=False``, or a
            :class:`CompletionResponseGen` generator when ``stream=True``.

        Raises:
            ValueError: If *prompt* exceeds :attr:`context_window` tokens.

        Examples:
            - Non-streaming completion — explore the response structure
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> response = llm.complete("Once upon a time")
                >>> response.raw["choices"][0]["text"] == response.text
                True
                >>> sorted(response.raw.keys())
                ['choices', 'created', 'id', 'model', 'object', 'usage']
                >>> response.raw["usage"]["prompt_tokens"] > 0
                True

                ```
            - Streaming completion — iterate over token deltas
                ```python
                >>> import os
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> chunks = list(llm.complete("Once upon a time", stream=True))
                >>> chunks[-1].text.startswith(chunks[0].delta)
                True
                >>> chunks[-1].text != chunks[0].delta
                True

                ```

        See Also:
            acomplete: Async variant that offloads inference to a thread pool.
            _complete: Non-streaming implementation.
            _stream_complete: Streaming implementation.
        """
        if not formatted:
            prompt = self.completion_to_prompt(prompt)  # type: ignore[misc]

        result: CompletionResponse | CompletionResponseGen = (
            self._stream_complete(prompt, **kwargs)
            if stream
            else self._complete(prompt, **kwargs)
        )
        return result

    @overload  # type: ignore[override]
    async def acomplete(
        self,
        prompt: str,
        formatted: bool = ...,
        *,
        stream: Literal[False] = ...,
        **kwargs: Any,
    ) -> CompletionResponse: ...

    @overload
    async def acomplete(
        self,
        prompt: str,
        formatted: bool = ...,
        *,
        stream: Literal[True],
        **kwargs: Any,
    ) -> CompletionResponseAsyncGen: ...

    async def acomplete(
        self,
        prompt: str,
        formatted: bool = False,
        *,
        stream: bool = False,
        **kwargs: Any,
    ) -> CompletionResponse | CompletionResponseAsyncGen:
        """Async text completion — offloads CPU-bound inference to a thread pool.

        Wraps :meth:`complete` in :func:`asyncio.to_thread` so that the
        llama-cpp-python C-level inference call never blocks the running event
        loop.  The streaming variant collects all token chunks in the worker
        thread and re-yields them as an async generator once all chunks are
        ready.

        Args:
            prompt: The input text to complete.
            formatted: When ``True``, *prompt* is passed to the model as-is.
                When ``False`` (default) it is first wrapped by
                :attr:`completion_to_prompt` to apply the model's chat template.
            stream: When ``True`` returns a :class:`CompletionResponseAsyncGen`
                async generator that yields one :class:`CompletionResponse` per
                token delta.  When ``False`` (default) returns a single
                :class:`CompletionResponse` with the full completion.
            **kwargs: Additional keyword arguments forwarded to the underlying
                ``Llama.__call__`` (e.g. ``top_p``, ``repeat_penalty``).

        Returns:
            A :class:`CompletionResponse` when ``stream=False``, or a
            :class:`CompletionResponseAsyncGen` async generator when
            ``stream=True``.

        Raises:
            ValueError: If *prompt* exceeds :attr:`context_window` tokens.

        Examples:
            - Non-streaming async completion — explore the response
                ```python
                >>> import os
                >>> import asyncio
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> response = asyncio.run(llm.acomplete("Once upon a time"))
                >>> response.raw["choices"][0]["text"] == response.text
                True
                >>> sorted(response.raw.keys())
                ['choices', 'created', 'id', 'model', 'object', 'usage']

                ```
            - Streaming async completion — collect and inspect chunks
                ```python
                >>> import os
                >>> import asyncio
                >>> from serapeum.llama_cpp import LlamaCPP
                >>> from serapeum.llama_cpp.formatters.llama3 import (
                ...     messages_to_prompt_v3_instruct,
                ...     completion_to_prompt_v3_instruct,
                ... )
                >>> llm = LlamaCPP(
                ...     model_path=os.environ["LLAMA_MODEL_PATH"],
                ...     temperature=0.1,
                ...     max_new_tokens=256,
                ...     context_window=512,
                ...     messages_to_prompt=messages_to_prompt_v3_instruct,
                ...     completion_to_prompt=completion_to_prompt_v3_instruct,
                ... )
                >>> async def _collect():
                ...     return [c async for c in await llm.acomplete("Once upon", stream=True)]
                >>> chunks = asyncio.run(_collect())
                >>> chunks[-1].text.startswith(chunks[0].delta)
                True
                >>> chunks[-1].text != chunks[0].delta
                True

                ```

        See Also:
            complete: Synchronous variant of this method.
            _complete: Non-streaming implementation called in the thread pool.
            _stream_complete: Streaming implementation called in the thread pool.
        """
        if stream:
            chunks: list[CompletionResponse] = await asyncio.to_thread(
                lambda: list(
                    self.complete(prompt, formatted=formatted, stream=True, **kwargs)
                )
            )

            async def gen() -> CompletionResponseAsyncGen:
                for chunk in chunks:
                    yield chunk

            result: CompletionResponse | CompletionResponseAsyncGen = gen()
        else:
            result = await asyncio.to_thread(self.complete, prompt, formatted, stream=False, **kwargs)  # type: ignore[arg-type]
        return result

    @retry(is_retryable, logger)
    def _complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        """Run a single non-streaming inference pass and return the full completion.

        Args:
            prompt: Already-formatted prompt string to send to the model.
            **kwargs: Additional keyword arguments forwarded to the underlying
                ``Llama.__call__`` (e.g. ``top_p``, ``repeat_penalty``).

        Returns:
            :class:`CompletionResponse` with the full generated text in
            ``.text`` and the raw llama-cpp-python response dict in ``.raw``.

        Raises:
            ValueError: If *prompt* exceeds :attr:`context_window` tokens
                (checked via :meth:`_guard_context`).

        See Also:
            _stream_complete: Streaming variant.
            _guard_context: Context-window overflow check.
        """
        self._guard_context(prompt)
        call_kwargs = {
            **self.generate_kwargs,
            "temperature": self.temperature,
            "max_tokens": self.max_new_tokens,
            "stream": False,
            **kwargs,
        }
        call_kwargs.setdefault("stop", self.stop or None)
        with self._model_lock:
            response = self._model(prompt=prompt, **call_kwargs)
        choices = response.get("choices", [])
        if not choices:
            raise RuntimeError(
                f"Model returned no choices for prompt (length={len(prompt)} chars). "
                f"Raw response: {response!r}"
            )
        return CompletionResponse(text=choices[0]["text"], raw=response)

    @retry(is_retryable, logger)
    def _stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        """Run a streaming inference pass and yield one response per token delta.

        The underlying ``Llama.__call__`` is called with ``stream=True`` inside
        a local generator so the model lock is held for the entire streaming
        session, preventing concurrent calls from corrupting state.

        Args:
            prompt: Already-formatted prompt string to send to the model.
            **kwargs: Additional keyword arguments forwarded to the underlying
                ``Llama.__call__``.

        Yields:
            :class:`CompletionResponse` objects — one per generated token —
            where ``.delta`` contains the incremental text and ``.text`` the
            cumulative completion so far.

        Raises:
            ValueError: If *prompt* exceeds :attr:`context_window` tokens
                (checked via :meth:`_guard_context`).

        See Also:
            _complete: Non-streaming variant.
            _guard_context: Context-window overflow check.
        """
        self._guard_context(prompt)
        call_kwargs = {
            **self.generate_kwargs,
            "temperature": self.temperature,
            "max_tokens": self.max_new_tokens,
            "stream": True,
            **kwargs,
        }
        call_kwargs.setdefault("stop", self.stop or None)

        text = ""
        with self._model_lock:
            try:
                for response in self._model(prompt=prompt, **call_kwargs):
                    choices = response.get("choices", [])
                    if not choices:
                        raise RuntimeError(
                            f"Model returned no choices in streaming response "
                            f"after generating {len(text)} chars. "
                            f"Raw response: {response!r}"
                        )
                    delta = choices[0]["text"]
                    text += delta
                    yield CompletionResponse(delta=delta, text=text, raw=response)
            except Exception:
                logger.exception(
                    f"Streaming inference failed after generating {text} chars",
                )
                raise

`metadata` `property` #

LLM metadata derived from the loaded model's configuration.

Returns:

Type	Description
`Metadata`	class:`~serapeum.core.llms.Metadata` instance with:
`Metadata`	`context_window`: effective context size from the loaded model.
`Metadata`	`num_output`: :attr:`max_new_tokens` configured for generation.
`Metadata`	`model_name`: resolved local path to the GGUF file.

Examples:

Inspect metadata fields of a loaded model

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> meta = llm.metadata
>>> meta.context_window
512
>>> meta.num_output
256
>>> meta.model_name.split(".")[-1]
'gguf'

See Also

class_name: Class identifier used for serialisation.

`acomplete(prompt, formatted=False, *, stream=False, **kwargs)` `async` #

acomplete(prompt: str, formatted: bool = ..., *, stream: Literal[False] = ..., **kwargs: Any) -> CompletionResponse

acomplete(prompt: str, formatted: bool = ..., *, stream: Literal[True], **kwargs: Any) -> CompletionResponseAsyncGen

Async text completion — offloads CPU-bound inference to a thread pool.

Wraps :meth:complete in :func:asyncio.to_thread so that the llama-cpp-python C-level inference call never blocks the running event loop. The streaming variant collects all token chunks in the worker thread and re-yields them as an async generator once all chunks are ready.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The input text to complete.	required
`formatted`	`bool`	When `True`, prompt is passed to the model as-is. When `False` (default) it is first wrapped by :attr:`completion_to_prompt` to apply the model's chat template.	`False`
`stream`	`bool`	When `True` returns a :class:`CompletionResponseAsyncGen` async generator that yields one :class:`CompletionResponse` per token delta. When `False` (default) returns a single :class:`CompletionResponse` with the full completion.	`False`
`**kwargs`	`Any`	Additional keyword arguments forwarded to the underlying `Llama.__call__` (e.g. `top_p`, `repeat_penalty`).	`{}`

Returns:

Name	Type	Description
`A`	`CompletionResponse \| CompletionResponseAsyncGen`	class:`CompletionResponse` when `stream=False`, or a
	`CompletionResponse \| CompletionResponseAsyncGen`	class:`CompletionResponseAsyncGen` async generator when
	`CompletionResponse \| CompletionResponseAsyncGen`	`stream=True`.

Raises:

Type	Description
`ValueError`	If prompt exceeds :attr:`context_window` tokens.

Examples:

Non-streaming async completion — explore the response

>>> import os
>>> import asyncio
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> response = asyncio.run(llm.acomplete("Once upon a time"))
>>> response.raw["choices"][0]["text"] == response.text
True
>>> sorted(response.raw.keys())
['choices', 'created', 'id', 'model', 'object', 'usage']

Streaming async completion — collect and inspect chunks

>>> import os
>>> import asyncio
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> async def _collect():
...     return [c async for c in await llm.acomplete("Once upon", stream=True)]
>>> chunks = asyncio.run(_collect())
>>> chunks[-1].text.startswith(chunks[0].delta)
True
>>> chunks[-1].text != chunks[0].delta
True

See Also

complete: Synchronous variant of this method. _complete: Non-streaming implementation called in the thread pool. _stream_complete: Streaming implementation called in the thread pool.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/llm.py

async def acomplete(
    self,
    prompt: str,
    formatted: bool = False,
    *,
    stream: bool = False,
    **kwargs: Any,
) -> CompletionResponse | CompletionResponseAsyncGen:
    """Async text completion — offloads CPU-bound inference to a thread pool.

    Wraps :meth:`complete` in :func:`asyncio.to_thread` so that the
    llama-cpp-python C-level inference call never blocks the running event
    loop.  The streaming variant collects all token chunks in the worker
    thread and re-yields them as an async generator once all chunks are
    ready.

    Args:
        prompt: The input text to complete.
        formatted: When ``True``, *prompt* is passed to the model as-is.
            When ``False`` (default) it is first wrapped by
            :attr:`completion_to_prompt` to apply the model's chat template.
        stream: When ``True`` returns a :class:`CompletionResponseAsyncGen`
            async generator that yields one :class:`CompletionResponse` per
            token delta.  When ``False`` (default) returns a single
            :class:`CompletionResponse` with the full completion.
        **kwargs: Additional keyword arguments forwarded to the underlying
            ``Llama.__call__`` (e.g. ``top_p``, ``repeat_penalty``).

    Returns:
        A :class:`CompletionResponse` when ``stream=False``, or a
        :class:`CompletionResponseAsyncGen` async generator when
        ``stream=True``.

    Raises:
        ValueError: If *prompt* exceeds :attr:`context_window` tokens.

    Examples:
        - Non-streaming async completion — explore the response
            ```python
            >>> import os
            >>> import asyncio
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> response = asyncio.run(llm.acomplete("Once upon a time"))
            >>> response.raw["choices"][0]["text"] == response.text
            True
            >>> sorted(response.raw.keys())
            ['choices', 'created', 'id', 'model', 'object', 'usage']

            ```
        - Streaming async completion — collect and inspect chunks
            ```python
            >>> import os
            >>> import asyncio
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> async def _collect():
            ...     return [c async for c in await llm.acomplete("Once upon", stream=True)]
            >>> chunks = asyncio.run(_collect())
            >>> chunks[-1].text.startswith(chunks[0].delta)
            True
            >>> chunks[-1].text != chunks[0].delta
            True

            ```

    See Also:
        complete: Synchronous variant of this method.
        _complete: Non-streaming implementation called in the thread pool.
        _stream_complete: Streaming implementation called in the thread pool.
    """
    if stream:
        chunks: list[CompletionResponse] = await asyncio.to_thread(
            lambda: list(
                self.complete(prompt, formatted=formatted, stream=True, **kwargs)
            )
        )

        async def gen() -> CompletionResponseAsyncGen:
            for chunk in chunks:
                yield chunk

        result: CompletionResponse | CompletionResponseAsyncGen = gen()
    else:
        result = await asyncio.to_thread(self.complete, prompt, formatted, stream=False, **kwargs)  # type: ignore[arg-type]
    return result

`class_name()` `classmethod` #

Return the canonical class identifier used in serialisation.

Returns:

Type	Description
`str`	The string `"LlamaCPP"`.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/llm.py

@classmethod
def class_name(cls) -> str:
    """Return the canonical class identifier used in serialisation.

    Returns:
        The string ``"LlamaCPP"``.
    """
    return "LlamaCPP"

`complete(prompt, formatted=False, *, stream=False, **kwargs)` #

complete(prompt: str, formatted: bool = ..., *, stream: Literal[False] = ..., **kwargs: Any) -> CompletionResponse

complete(prompt: str, formatted: bool = ..., *, stream: Literal[True], **kwargs: Any) -> CompletionResponseGen

Run text completion, optionally streaming token-by-token.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The input text to complete.	required
`formatted`	`bool`	When `True`, prompt is passed to the model as-is. When `False` (default) it is first wrapped by :attr:`completion_to_prompt` to apply the model's chat template.	`False`
`stream`	`bool`	When `True` returns a :class:`CompletionResponseGen` generator that yields one :class:`CompletionResponse` per token delta. When `False` (default) returns a single :class:`CompletionResponse` with the full completion.	`False`
`**kwargs`	`Any`	Additional keyword arguments forwarded to the underlying `Llama.__call__` (e.g. `top_p`, `repeat_penalty`).	`{}`

Returns:

Name	Type	Description
`A`	`CompletionResponse \| CompletionResponseGen`	class:`CompletionResponse` when `stream=False`, or a
	`CompletionResponse \| CompletionResponseGen`	class:`CompletionResponseGen` generator when `stream=True`.

Raises:

Type	Description
`ValueError`	If prompt exceeds :attr:`context_window` tokens.

Examples:

Non-streaming completion — explore the response structure

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> response = llm.complete("Once upon a time")
>>> response.raw["choices"][0]["text"] == response.text
True
>>> sorted(response.raw.keys())
['choices', 'created', 'id', 'model', 'object', 'usage']
>>> response.raw["usage"]["prompt_tokens"] > 0
True

Streaming completion — iterate over token deltas

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> chunks = list(llm.complete("Once upon a time", stream=True))
>>> chunks[-1].text.startswith(chunks[0].delta)
True
>>> chunks[-1].text != chunks[0].delta
True

See Also

acomplete: Async variant that offloads inference to a thread pool. _complete: Non-streaming implementation. _stream_complete: Streaming implementation.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/llm.py

def complete(
    self,
    prompt: str,
    formatted: bool = False,
    *,
    stream: bool = False,
    **kwargs: Any,
) -> CompletionResponse | CompletionResponseGen:
    """Run text completion, optionally streaming token-by-token.

    Args:
        prompt: The input text to complete.
        formatted: When ``True``, *prompt* is passed to the model as-is.
            When ``False`` (default) it is first wrapped by
            :attr:`completion_to_prompt` to apply the model's chat template.
        stream: When ``True`` returns a :class:`CompletionResponseGen`
            generator that yields one :class:`CompletionResponse` per
            token delta.  When ``False`` (default) returns a single
            :class:`CompletionResponse` with the full completion.
        **kwargs: Additional keyword arguments forwarded to the underlying
            ``Llama.__call__`` (e.g. ``top_p``, ``repeat_penalty``).

    Returns:
        A :class:`CompletionResponse` when ``stream=False``, or a
        :class:`CompletionResponseGen` generator when ``stream=True``.

    Raises:
        ValueError: If *prompt* exceeds :attr:`context_window` tokens.

    Examples:
        - Non-streaming completion — explore the response structure
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> response = llm.complete("Once upon a time")
            >>> response.raw["choices"][0]["text"] == response.text
            True
            >>> sorted(response.raw.keys())
            ['choices', 'created', 'id', 'model', 'object', 'usage']
            >>> response.raw["usage"]["prompt_tokens"] > 0
            True

            ```
        - Streaming completion — iterate over token deltas
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> chunks = list(llm.complete("Once upon a time", stream=True))
            >>> chunks[-1].text.startswith(chunks[0].delta)
            True
            >>> chunks[-1].text != chunks[0].delta
            True

            ```

    See Also:
        acomplete: Async variant that offloads inference to a thread pool.
        _complete: Non-streaming implementation.
        _stream_complete: Streaming implementation.
    """
    if not formatted:
        prompt = self.completion_to_prompt(prompt)  # type: ignore[misc]

    result: CompletionResponse | CompletionResponseGen = (
        self._stream_complete(prompt, **kwargs)
        if stream
        else self._complete(prompt, **kwargs)
    )
    return result

`count_tokens(text)` #

Return the number of tokens text encodes to.

Parameters:

Name	Type	Description	Default
`text`	`str`	The input string to count tokens for.	required

Returns:

Type	Description
`int`	Integer token count for text.

Examples:

count_tokens is consistent with tokenize

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> llm.count_tokens("Hello!") == len(llm.tokenize("Hello!"))
True

Longer text yields a higher count

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> short_count = llm.count_tokens("Hi")
>>> long_count = llm.count_tokens("Hello, how are you doing today?")
>>> long_count > short_count  # more text yields a higher count
True
>>> short_count == len(llm.tokenize("Hi"))  # consistent with tokenize
True

See Also

tokenize: Returns the full token ID list. _guard_context: Calls this method to check prompt length.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/llm.py

def count_tokens(self, text: str) -> int:
    """Return the number of tokens *text* encodes to.

    Args:
        text: The input string to count tokens for.

    Returns:
        Integer token count for *text*.

    Examples:
        - count_tokens is consistent with tokenize
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> llm.count_tokens("Hello!") == len(llm.tokenize("Hello!"))
            True

            ```
        - Longer text yields a higher count
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> short_count = llm.count_tokens("Hi")
            >>> long_count = llm.count_tokens("Hello, how are you doing today?")
            >>> long_count > short_count  # more text yields a higher count
            True
            >>> short_count == len(llm.tokenize("Hi"))  # consistent with tokenize
            True

            ```

    See Also:
        tokenize: Returns the full token ID list.
        _guard_context: Calls this method to check prompt length.
    """
    return len(self.tokenize(text))

`model_post_init(__context)` #

Resolve the model path, download if needed, then load the model.

Called automatically by Pydantic after __init__. All validation has already completed before this method runs; it performs only I/O (path resolution, optional download, GGUF loading).

See Also

_resolve_model_path: Locates or downloads the GGUF file. _load_model: Loads (or retrieves from cache) the Llama instance.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/llm.py

def model_post_init(self, __context: Any) -> None:
    """Resolve the model path, download if needed, then load the model.

    Called automatically by Pydantic after ``__init__``.  All validation
    has already completed before this method runs; it performs only I/O
    (path resolution, optional download, GGUF loading).

    See Also:
        _resolve_model_path: Locates or downloads the GGUF file.
        _load_model: Loads (or retrieves from cache) the Llama instance.
    """
    model_path = self._resolve_model_path()
    self._model = self._load_model(model_path)

`tokenize(text)` #

Return the token IDs for text using the loaded model's vocabulary.

Parameters:

Name	Type	Description	Default
`text`	`str`	The input string to tokenize.	required

Returns:

Type	Description
`list[int]`	List of integer token IDs produced by the model's tokenizer.

Examples:

Tokenize a short string and explore the token IDs

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> tokens = llm.tokenize("Hello!")
>>> tokens[0]
1
>>> all(t >= 0 for t in tokens)
True

Longer text produces more tokens

>>> import os
>>> from serapeum.llama_cpp import LlamaCPP
>>> from serapeum.llama_cpp.formatters.llama3 import (
...     messages_to_prompt_v3_instruct,
...     completion_to_prompt_v3_instruct,
... )
>>> llm = LlamaCPP(
...     model_path=os.environ["LLAMA_MODEL_PATH"],
...     temperature=0.1,
...     max_new_tokens=256,
...     context_window=512,
...     messages_to_prompt=messages_to_prompt_v3_instruct,
...     completion_to_prompt=completion_to_prompt_v3_instruct,
... )
>>> short = llm.tokenize("Hi")
>>> long = llm.tokenize("Hello, how are you doing today?")
>>> short[0] == long[0]  # both start with the BOS token
True
>>> len(long) > len(short)  # more text yields more tokens
True

See Also

count_tokens: Returns the token count instead of the full list. _guard_context: Uses token count to validate prompt length.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/llm.py

def tokenize(self, text: str) -> list[int]:
    """Return the token IDs for *text* using the loaded model's vocabulary.

    Args:
        text: The input string to tokenize.

    Returns:
        List of integer token IDs produced by the model's tokenizer.

    Examples:
        - Tokenize a short string and explore the token IDs
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> tokens = llm.tokenize("Hello!")
            >>> tokens[0]
            1
            >>> all(t >= 0 for t in tokens)
            True

            ```
        - Longer text produces more tokens
            ```python
            >>> import os
            >>> from serapeum.llama_cpp import LlamaCPP
            >>> from serapeum.llama_cpp.formatters.llama3 import (
            ...     messages_to_prompt_v3_instruct,
            ...     completion_to_prompt_v3_instruct,
            ... )
            >>> llm = LlamaCPP(
            ...     model_path=os.environ["LLAMA_MODEL_PATH"],
            ...     temperature=0.1,
            ...     max_new_tokens=256,
            ...     context_window=512,
            ...     messages_to_prompt=messages_to_prompt_v3_instruct,
            ...     completion_to_prompt=completion_to_prompt_v3_instruct,
            ... )
            >>> short = llm.tokenize("Hi")
            >>> long = llm.tokenize("Hello, how are you doing today?")
            >>> short[0] == long[0]  # both start with the BOS token
            True
            >>> len(long) > len(short)  # more text yields more tokens
            True

            ```

    See Also:
        count_tokens: Returns the token count instead of the full list.
        _guard_context: Uses token count to validate prompt length.
    """
    return self._model.tokenize(text.encode())  # type: ignore[no-any-return]

utils module #

`serapeum.llama_cpp.utils` #

Internal utilities for downloading GGUF model files.

This module provides two private helpers used by :class:~serapeum.llama_cpp.LlamaCPP to resolve a model path before loading:

:func:_fetch_model_file — streams a GGUF file from an arbitrary URL with progress reporting and automatic cleanup on failure.
:func:_fetch_model_file_hf — downloads from HuggingFace Hub using the huggingface_hub library (optional dependency).

These functions are internal — they are not part of the public API. External callers should use :class:~serapeum.llama_cpp.LlamaCPP directly.

See Also

serapeum.llama_cpp.llm: The LlamaCPP class that consumes these helpers.

formatters module #

`serapeum.llama_cpp.formatters.llama2` #

Prompt formatters for Llama 2 Chat and Mistral Instruct models.

Implements the [INST] <<SYS>> … <</SYS>> template described in the official Llama 2 blog post: https://huggingface.co/blog/llama2#how-to-prompt-llama-2

This format is compatible with:

Llama 2 Chat (7B, 13B, 70B)
Mistral Instruct v0.1 / v0.2
Any other model trained on the Llama 2 Chat template

Typical usage::

from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

See Also

serapeum.llama_cpp.formatters.llama3: Formatter for Llama 3 Instruct models.

`completion_to_prompt(completion, system_prompt=None)` #

Convert a plain-text completion to Llama 2 Chat single-turn prompt format.

Wraps completion in the [INST] <<SYS>> … <</SYS>> … [/INST] envelope expected by Llama 2 Chat and Mistral Instruct models for single-turn (non-chat) text completion.

Parameters:

Name	Type	Description	Default
`completion`	`str`	The user's instruction or question as plain text.	required
`system_prompt`	`str \| None`	System-level instruction inserted inside `<<SYS>>…<</SYS>>`. Defaults to :data:`DEFAULT_SYSTEM_PROMPT` when `None`.	`None`

Returns:

Type	Description
`str`	Prompt string in Llama 2 `<s> [INST] <<SYS>> … <</SYS>> … [/INST]`
`str`	format, ready to be passed to a Llama 2 / Mistral GGUF model.

Examples:

Build a prompt with a custom system prompt — explore the template structure

>>> from serapeum.llama_cpp.formatters.llama2 import completion_to_prompt
>>> prompt = completion_to_prompt("What is 2+2?", system_prompt="Be brief.")
>>> prompt[:10]
'<s> [INST]'
>>> prompt.rstrip()[-7:]
'[/INST]'
>>> prompt.split("<</SYS>>")[0].split("<<SYS>>")[1].strip()
'Be brief.'
>>> prompt.split("<</SYS>>")[1].split("[/INST]")[0].strip()
'What is 2+2?'

Build a prompt with the default system prompt

>>> from serapeum.llama_cpp.formatters.llama2 import completion_to_prompt, DEFAULT_SYSTEM_PROMPT
>>> prompt = completion_to_prompt("Hello!")
>>> DEFAULT_SYSTEM_PROMPT.strip() in prompt
True
>>> prompt.split("<</SYS>>")[1].split("[/INST]")[0].strip()
'Hello!'

See Also

messages_to_prompt: Multi-turn chat variant for the same model family. DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/formatters/llama2.py

def completion_to_prompt(completion: str, system_prompt: str | None = None) -> str:
    """Convert a plain-text completion to Llama 2 Chat single-turn prompt format.

    Wraps *completion* in the ``[INST] <<SYS>> … <</SYS>> … [/INST]`` envelope
    expected by Llama 2 Chat and Mistral Instruct models for single-turn
    (non-chat) text completion.

    Args:
        completion: The user's instruction or question as plain text.
        system_prompt: System-level instruction inserted inside
            ``<<SYS>>…<</SYS>>``.  Defaults to :data:`DEFAULT_SYSTEM_PROMPT`
            when ``None``.

    Returns:
        Prompt string in Llama 2 ``<s> [INST] <<SYS>> … <</SYS>> … [/INST]``
        format, ready to be passed to a Llama 2 / Mistral GGUF model.

    Examples:
        - Build a prompt with a custom system prompt — explore the template structure
            ```python
            >>> from serapeum.llama_cpp.formatters.llama2 import completion_to_prompt
            >>> prompt = completion_to_prompt("What is 2+2?", system_prompt="Be brief.")
            >>> prompt[:10]
            '<s> [INST]'
            >>> prompt.rstrip()[-7:]
            '[/INST]'
            >>> prompt.split("<</SYS>>")[0].split("<<SYS>>")[1].strip()
            'Be brief.'
            >>> prompt.split("<</SYS>>")[1].split("[/INST]")[0].strip()
            'What is 2+2?'

            ```
        - Build a prompt with the default system prompt
            ```python
            >>> from serapeum.llama_cpp.formatters.llama2 import completion_to_prompt, DEFAULT_SYSTEM_PROMPT
            >>> prompt = completion_to_prompt("Hello!")
            >>> DEFAULT_SYSTEM_PROMPT.strip() in prompt
            True
            >>> prompt.split("<</SYS>>")[1].split("[/INST]")[0].strip()
            'Hello!'

            ```

    See Also:
        messages_to_prompt: Multi-turn chat variant for the same model family.
        DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.
    """
    system_prompt_str = system_prompt or DEFAULT_SYSTEM_PROMPT

    return (
        f"{BOS} {B_INST} {B_SYS} {system_prompt_str.strip()} {E_SYS} "
        f"{completion.strip()} {E_INST}"
    )

`messages_to_prompt(messages, system_prompt=None)` #

Convert a sequence of chat messages to Llama 2 Chat prompt format.

Reference: https://huggingface.co/blog/llama2#how-to-prompt-llama-2

Parameters:

Name	Type	Description	Default
`messages`	`Sequence[Message]`	Ordered sequence of chat messages. If the first message has role SYSTEM it is extracted as the system prompt; otherwise system_prompt (or `DEFAULT_SYSTEM_PROMPT`) is used. Remaining messages must alternate USER / ASSISTANT.	required
`system_prompt`	`str \| None`	Optional system-level instruction. Ignored when the first message already carries role SYSTEM.	`None`

Returns:

Type	Description
`str`	Prompt string in Llama 2 `[INST] … [/INST]` format.

Raises:

Type	Description
`ValueError`	If a USER or ASSISTANT message appears in the wrong position in the alternating sequence.

Examples:

Single user message with a custom system prompt — explore the structure

>>> from serapeum.llama_cpp.formatters.llama2 import messages_to_prompt
>>> from serapeum.core.llms import Message, MessageRole, TextChunk
>>> messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])]
>>> prompt = messages_to_prompt(messages, system_prompt="Be brief.")
>>> prompt[:10]
'<s> [INST]'
>>> prompt.split("<</SYS>>")[0].split("<<SYS>>")[1].strip()
'Be brief.'
>>> prompt.split("[/INST]")[0].split("<</SYS>>")[1].strip()
'Hello!'

Multi-turn conversation — each turn is wrapped in [INST]...[/INST]

>>> from serapeum.llama_cpp.formatters.llama2 import messages_to_prompt
>>> from serapeum.core.llms import Message, MessageRole, TextChunk
>>> messages = [
...     Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")]),
...     Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
...     Message(role=MessageRole.USER, chunks=[TextChunk(content="And 3+3?")]),
... ]
>>> prompt = messages_to_prompt(messages, system_prompt="Be brief.")
>>> prompt.count("[INST]")
2
>>> prompt.count("[/INST]")
2

Explicit SYSTEM message in the conversation is extracted as system prompt

>>> from serapeum.llama_cpp.formatters.llama2 import messages_to_prompt
>>> from serapeum.core.llms import Message, MessageRole, TextChunk
>>> messages = [
...     Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are terse.")]),
...     Message(role=MessageRole.USER, chunks=[TextChunk(content="Hi!")]),
... ]
>>> prompt = messages_to_prompt(messages)
>>> prompt.split("<</SYS>>")[0].split("<<SYS>>")[1].strip()
'You are terse.'

See Also

completion_to_prompt: Single-turn variant for the same model family. DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/formatters/llama2.py

def messages_to_prompt(
    messages: Sequence[Message], system_prompt: str | None = None
) -> str:
    """Convert a sequence of chat messages to Llama 2 Chat prompt format.

    Reference: https://huggingface.co/blog/llama2#how-to-prompt-llama-2

    Args:
        messages: Ordered sequence of chat messages. If the first message has
            role SYSTEM it is extracted as the system prompt; otherwise
            *system_prompt* (or ``DEFAULT_SYSTEM_PROMPT``) is used. Remaining
            messages must alternate USER / ASSISTANT.
        system_prompt: Optional system-level instruction. Ignored when the
            first message already carries role SYSTEM.

    Returns:
        Prompt string in Llama 2 ``[INST] … [/INST]`` format.

    Raises:
        ValueError: If a USER or ASSISTANT message appears in the wrong
            position in the alternating sequence.

    Examples:
        - Single user message with a custom system prompt — explore the structure
            ```python
            >>> from serapeum.llama_cpp.formatters.llama2 import messages_to_prompt
            >>> from serapeum.core.llms import Message, MessageRole, TextChunk
            >>> messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])]
            >>> prompt = messages_to_prompt(messages, system_prompt="Be brief.")
            >>> prompt[:10]
            '<s> [INST]'
            >>> prompt.split("<</SYS>>")[0].split("<<SYS>>")[1].strip()
            'Be brief.'
            >>> prompt.split("[/INST]")[0].split("<</SYS>>")[1].strip()
            'Hello!'

            ```
        - Multi-turn conversation — each turn is wrapped in [INST]...[/INST]
            ```python
            >>> from serapeum.llama_cpp.formatters.llama2 import messages_to_prompt
            >>> from serapeum.core.llms import Message, MessageRole, TextChunk
            >>> messages = [
            ...     Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")]),
            ...     Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
            ...     Message(role=MessageRole.USER, chunks=[TextChunk(content="And 3+3?")]),
            ... ]
            >>> prompt = messages_to_prompt(messages, system_prompt="Be brief.")
            >>> prompt.count("[INST]")
            2
            >>> prompt.count("[/INST]")
            2

            ```
        - Explicit SYSTEM message in the conversation is extracted as system prompt
            ```python
            >>> from serapeum.llama_cpp.formatters.llama2 import messages_to_prompt
            >>> from serapeum.core.llms import Message, MessageRole, TextChunk
            >>> messages = [
            ...     Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are terse.")]),
            ...     Message(role=MessageRole.USER, chunks=[TextChunk(content="Hi!")]),
            ... ]
            >>> prompt = messages_to_prompt(messages)
            >>> prompt.split("<</SYS>>")[0].split("<<SYS>>")[1].strip()
            'You are terse.'

            ```

    See Also:
        completion_to_prompt: Single-turn variant for the same model family.
        DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.
    """
    if not messages:
        raise ValueError(
            "messages must contain at least one message. "
            "Pass at least a USER message."
        )

    string_messages: list[str] = []
    if messages[0].role == MessageRole.SYSTEM:
        system_message_str = messages[0].content or ""
        remaining = list(messages[1:])
    else:
        system_message_str = system_prompt or DEFAULT_SYSTEM_PROMPT
        remaining = list(messages)

    system_message_str = f"{B_SYS} {system_message_str.strip()} {E_SYS}"

    for i in range(0, len(remaining), 2):
        user_message = remaining[i]
        if user_message.role != MessageRole.USER:
            raise ValueError(
                f"Expected a USER message at position {i}, "
                f"got role {user_message.role!r}."
            )

        if i == 0:
            str_message = f"{BOS} {B_INST} {system_message_str} "
        else:
            string_messages[-1] += f" {EOS}"
            str_message = f"{BOS} {B_INST} "

        str_message += f"{user_message.content} {E_INST}"

        if len(remaining) > (i + 1):
            assistant_message = remaining[i + 1]
            if assistant_message.role != MessageRole.ASSISTANT:
                raise ValueError(
                    f"Expected an ASSISTANT message at position {i + 1}, "
                    f"got role {assistant_message.role!r}."
                )
            str_message += f" {assistant_message.content}"

        string_messages.append(str_message)

    return "".join(string_messages)

`serapeum.llama_cpp.formatters.llama3` #

Prompt formatters for Llama 3 Instruct models.

This format is compatible with:

Meta-Llama-3-8B-Instruct
Meta-Llama-3-70B-Instruct
Any other model trained on the Llama 3 chat template

Note

<|begin_of_text|> is intentionally omitted because llama-cpp-python adds it automatically when loading the model.

Typical usage::

from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

See Also

serapeum.llama_cpp.formatters.llama2: Formatter for Llama 2 / Mistral models.

`completion_to_prompt_v3_instruct(completion, system_prompt=None)` #

Convert a plain-text completion to Llama 3 Instruct single-turn prompt format.

Note

<|begin_of_text|> is intentionally omitted; llama-cpp-python adds it automatically during model loading.

Parameters:

Name	Type	Description	Default
`completion`	`str`	The user's instruction or question as plain text.	required
`system_prompt`	`str \| None`	System-level instruction inserted in the system header block. Defaults to :data:`DEFAULT_SYSTEM_PROMPT` when `None`.	`None`

Returns:

Type	Description
`str`	Prompt string ending with the
`str`	`<\|start_header_id\|>assistant<\|end_header_id\|>\\n\\n` header that
`str`	prompts the model to generate its reply.

Examples:

Build a prompt with a custom system prompt — explore the sections

>>> from serapeum.llama_cpp.formatters.llama3 import completion_to_prompt_v3_instruct
>>> prompt = completion_to_prompt_v3_instruct("What is 2+2?", "Be brief.")
>>> sections = prompt.split("<|eot_id|>")
>>> sections[0].split("<|end_header_id|>")[1].strip()
'Be brief.'
>>> sections[1].split("<|end_header_id|>")[1].strip()
'What is 2+2?'
>>> prompt.strip().endswith("<|end_header_id|>")
True

Build a prompt with the default system prompt

>>> from serapeum.llama_cpp.formatters.llama3 import completion_to_prompt_v3_instruct, DEFAULT_SYSTEM_PROMPT
>>> prompt = completion_to_prompt_v3_instruct("Hello!")
>>> DEFAULT_SYSTEM_PROMPT.strip() in prompt
True
>>> prompt.split("<|eot_id|>")[1].split("<|end_header_id|>")[1].strip()
'Hello!'

See Also

messages_to_prompt_v3_instruct: Multi-turn chat variant for the same model family. DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/formatters/llama3.py

def completion_to_prompt_v3_instruct(
    completion: str, system_prompt: str | None = None
) -> str:
    r"""Convert a plain-text completion to Llama 3 Instruct single-turn prompt format.

    Wraps *completion* in the ``<|start_header_id|>user<|end_header_id|>`` /
    ``<|eot_id|>`` envelope expected by Llama 3 Instruct models for single-turn
    (non-chat) text completion.

    Note:
        ``<|begin_of_text|>`` is intentionally omitted; llama-cpp-python adds
        it automatically during model loading.

    Args:
        completion: The user's instruction or question as plain text.
        system_prompt: System-level instruction inserted in the system header
            block.  Defaults to :data:`DEFAULT_SYSTEM_PROMPT` when ``None``.

    Returns:
        Prompt string ending with the
        ``<|start_header_id|>assistant<|end_header_id|>\\n\\n`` header that
        prompts the model to generate its reply.

    Examples:
        - Build a prompt with a custom system prompt — explore the sections
            ```python
            >>> from serapeum.llama_cpp.formatters.llama3 import completion_to_prompt_v3_instruct
            >>> prompt = completion_to_prompt_v3_instruct("What is 2+2?", "Be brief.")
            >>> sections = prompt.split("<|eot_id|>")
            >>> sections[0].split("<|end_header_id|>")[1].strip()
            'Be brief.'
            >>> sections[1].split("<|end_header_id|>")[1].strip()
            'What is 2+2?'
            >>> prompt.strip().endswith("<|end_header_id|>")
            True

            ```
        - Build a prompt with the default system prompt
            ```python
            >>> from serapeum.llama_cpp.formatters.llama3 import completion_to_prompt_v3_instruct, DEFAULT_SYSTEM_PROMPT
            >>> prompt = completion_to_prompt_v3_instruct("Hello!")
            >>> DEFAULT_SYSTEM_PROMPT.strip() in prompt
            True
            >>> prompt.split("<|eot_id|>")[1].split("<|end_header_id|>")[1].strip()
            'Hello!'

            ```

    See Also:
        messages_to_prompt_v3_instruct: Multi-turn chat variant for the same model family.
        DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.
    """
    system_prompt_str = system_prompt or DEFAULT_SYSTEM_PROMPT

    return (
        f"{HEADER_SYS}{system_prompt_str.strip()}{EOT}"
        f"{HEADER_USER}{completion.strip()}{EOT}"
        f"{HEADER_ASSIST}"
    )

`messages_to_prompt_v3_instruct(messages, system_prompt=None)` #

Convert a sequence of chat messages to Llama 3 Instruct format.

Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

Note: <|begin_of_text|> is not needed as Llama.cpp appears to add it already.

Parameters:

Name	Type	Description	Default
`messages`	`Sequence[Message]`	Ordered sequence of chat messages. If the first message has role SYSTEM it is extracted as the system prompt; otherwise system_prompt (or `DEFAULT_SYSTEM_PROMPT`) is used. Remaining messages must alternate USER / ASSISTANT.	required
`system_prompt`	`str \| None`	Optional system-level instruction. Ignored when the first message already carries role SYSTEM.	`None`

Returns:

Type	Description
`str`	Prompt string in Llama 3 `<\|start_header_id\|>…<\|eot_id\|>` format.

Raises:

Type	Description
`ValueError`	If a USER or ASSISTANT message appears in the wrong position in the alternating sequence.

Examples:

Single user message with a custom system prompt — explore the structure

>>> from serapeum.llama_cpp.formatters.llama3 import messages_to_prompt_v3_instruct
>>> from serapeum.core.llms import Message, MessageRole, TextChunk
>>> messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])]
>>> prompt = messages_to_prompt_v3_instruct(messages, system_prompt="Be brief.")
>>> prompt.split("<|eot_id|>")[0].split("<|end_header_id|>")[1].strip()
'Be brief.'
>>> prompt.split("<|eot_id|>")[1].split("<|end_header_id|>")[1].strip()
'Hello!'

Multi-turn conversation — prompt ends with the assistant header

>>> from serapeum.llama_cpp.formatters.llama3 import messages_to_prompt_v3_instruct
>>> from serapeum.core.llms import Message, MessageRole, TextChunk
>>> messages = [
...     Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")]),
...     Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
...     Message(role=MessageRole.USER, chunks=[TextChunk(content="And 3+3?")]),
... ]
>>> prompt = messages_to_prompt_v3_instruct(messages, system_prompt="Be brief.")
>>> prompt.count("<|eot_id|>")
4
>>> prompt.strip().endswith("<|end_header_id|>")
True

Explicit SYSTEM message is extracted as system prompt

>>> from serapeum.llama_cpp.formatters.llama3 import messages_to_prompt_v3_instruct
>>> from serapeum.core.llms import Message, MessageRole, TextChunk
>>> messages = [
...     Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are terse.")]),
...     Message(role=MessageRole.USER, chunks=[TextChunk(content="Hi!")]),
... ]
>>> prompt = messages_to_prompt_v3_instruct(messages)
>>> prompt.split("<|eot_id|>")[0].split("<|end_header_id|>")[1].strip()
'You are terse.'

See Also

completion_to_prompt_v3_instruct: Single-turn variant for the same model family. DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.

Source code in libs/providers/llama-cpp/src/serapeum/llama_cpp/formatters/llama3.py

def messages_to_prompt_v3_instruct(
    messages: Sequence[Message], system_prompt: str | None = None
) -> str:
    """Convert a sequence of chat messages to Llama 3 Instruct format.

    Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

    Note: ``<|begin_of_text|>`` is not needed as Llama.cpp appears to add it already.

    Args:
        messages: Ordered sequence of chat messages. If the first message has
            role SYSTEM it is extracted as the system prompt; otherwise
            *system_prompt* (or ``DEFAULT_SYSTEM_PROMPT``) is used. Remaining
            messages must alternate USER / ASSISTANT.
        system_prompt: Optional system-level instruction. Ignored when the
            first message already carries role SYSTEM.

    Returns:
        Prompt string in Llama 3 ``<|start_header_id|>…<|eot_id|>`` format.

    Raises:
        ValueError: If a USER or ASSISTANT message appears in the wrong
            position in the alternating sequence.

    Examples:
        - Single user message with a custom system prompt — explore the structure
            ```python
            >>> from serapeum.llama_cpp.formatters.llama3 import messages_to_prompt_v3_instruct
            >>> from serapeum.core.llms import Message, MessageRole, TextChunk
            >>> messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])]
            >>> prompt = messages_to_prompt_v3_instruct(messages, system_prompt="Be brief.")
            >>> prompt.split("<|eot_id|>")[0].split("<|end_header_id|>")[1].strip()
            'Be brief.'
            >>> prompt.split("<|eot_id|>")[1].split("<|end_header_id|>")[1].strip()
            'Hello!'

            ```
        - Multi-turn conversation — prompt ends with the assistant header
            ```python
            >>> from serapeum.llama_cpp.formatters.llama3 import messages_to_prompt_v3_instruct
            >>> from serapeum.core.llms import Message, MessageRole, TextChunk
            >>> messages = [
            ...     Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")]),
            ...     Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
            ...     Message(role=MessageRole.USER, chunks=[TextChunk(content="And 3+3?")]),
            ... ]
            >>> prompt = messages_to_prompt_v3_instruct(messages, system_prompt="Be brief.")
            >>> prompt.count("<|eot_id|>")
            4
            >>> prompt.strip().endswith("<|end_header_id|>")
            True

            ```
        - Explicit SYSTEM message is extracted as system prompt
            ```python
            >>> from serapeum.llama_cpp.formatters.llama3 import messages_to_prompt_v3_instruct
            >>> from serapeum.core.llms import Message, MessageRole, TextChunk
            >>> messages = [
            ...     Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are terse.")]),
            ...     Message(role=MessageRole.USER, chunks=[TextChunk(content="Hi!")]),
            ... ]
            >>> prompt = messages_to_prompt_v3_instruct(messages)
            >>> prompt.split("<|eot_id|>")[0].split("<|end_header_id|>")[1].strip()
            'You are terse.'

            ```

    See Also:
        completion_to_prompt_v3_instruct: Single-turn variant for the same model family.
        DEFAULT_SYSTEM_PROMPT: Default system instruction used when system_prompt is None.
    """
    if not messages:
        raise ValueError(
            "messages must contain at least one message. "
            "Pass at least a USER message."
        )

    string_messages: list[str] = []
    if messages[0].role == MessageRole.SYSTEM:
        system_message_str = messages[0].content or ""
        remaining = list(messages[1:])
    else:
        system_message_str = system_prompt or DEFAULT_SYSTEM_PROMPT
        remaining = list(messages)

    string_messages.append(f"{HEADER_SYS}{system_message_str.strip()}{EOT}")

    for i in range(0, len(remaining), 2):
        user_message = remaining[i]
        if user_message.role != MessageRole.USER:
            raise ValueError(
                f"Expected a USER message at position {i}, "
                f"got role {user_message.role!r}."
            )
        str_message = f"{HEADER_USER}{user_message.content}{EOT}"

        if len(remaining) > (i + 1):
            assistant_message = remaining[i + 1]
            if assistant_message.role != MessageRole.ASSISTANT:
                raise ValueError(
                    f"Expected an ASSISTANT message at position {i + 1}, "
                    f"got role {assistant_message.role!r}."
                )
            str_message += f"{HEADER_ASSIST}{assistant_message.content}{EOT}"

        string_messages.append(str_message)

    string_messages.append(HEADER_ASSIST)

    return "".join(string_messages)