Skip to content

LlamaCPP Usage Examples#

This guide provides comprehensive examples covering all possible ways to use the LlamaCPP class based on real test cases from the codebase.

Prerequisites: Model File#

LlamaCPP requires a GGUF model file. For a quick test you can download the tiny stories260K model (~500 KB):

curl -L -o stories260K.gguf \
  https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf

Or just copy and paste the URL into your browser, and the download will start.

Set the LLAMA_MODEL_PATH environment variable to the downloaded file:

export LLAMA_MODEL_PATH="/path/to/stories260K.gguf"

Or add it to a .env file in your project root:

# .env
LLAMA_MODEL_PATH=/path/to/stories260K.gguf

If you use a .env file, load it before running:

from dotenv import load_dotenv
load_dotenv()  # loads LLAMA_MODEL_PATH from .env

All examples below read the model path from this environment variable.

You can also let LlamaCPP download a model automatically via model_url or hf_model_id (see Initialization Patterns).


Table of Contents#

  1. Basic Usage
  2. Initialization Patterns
  3. Completion Operations
  4. Chat Operations
  5. Streaming Operations
  6. Async Operations
  7. Integration with Orchestrators
  8. Prompt Formatters

Basic Usage#

Simple Completion#

The most straightforward way to use LlamaCPP:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

model_path = os.environ.get("LLAMA_MODEL_PATH")
llm = LlamaCPP(
    model_path=model_path,
    temperature=0.1,
    max_new_tokens=256,
    context_window=8192,
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete("Say hello.")
print(response.text)  # "Hello! How are you?"

Simple Chat#

Using the chat API (via CompletionToChatMixin):

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.core.llms import Message, MessageRole
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")])]
response = llm.chat(messages)
print(response.message.content)  # "4"

Initialization Patterns#

1. From Local Model Path#

Load a GGUF file already on disk:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

2. From URL Download#

Download and cache a GGUF model from a direct URL:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    model_url="https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

3. From HuggingFace Hub#

Download from HuggingFace Hub using huggingface_hub:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    hf_model_id="ggml-org/models",
    hf_filename="tinyllamas/stories260K.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

4. Full Configuration#

With all common parameters:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    temperature=0.2,
    max_new_tokens=512,
    context_window=8192,
    n_gpu_layers=-1,           # Offload all layers to GPU
    stop=["</s>", "<|eot_id|>"],
    verbose=False,
    generate_kwargs={"top_p": 0.9, "top_k": 40},
    model_kwargs={"n_threads": 8},
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

5. With Llama 2 Formatters#

Using Llama 2 / Mistral-style prompt templates:

```python notest from serapeum.llama_cpp import LlamaCPP from serapeum.llama_cpp.formatters.llama2 import ( messages_to_prompt, completion_to_prompt, )

llm = LlamaCPP( model_path="/models/llama-2-13b-chat.Q4_0.gguf", temperature=0.1, max_new_tokens=256, context_window=4096, messages_to_prompt=messages_to_prompt, completion_to_prompt=completion_to_prompt, )

---

## Completion Operations

### 1. Basic Completion

Simple text completion:

```python
import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete("The capital of France is")
print(response.text)  # "Paris"

2. Completion with Custom Parameters#

Override default generation settings:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete(
    "Once upon a time",
    temperature=0.8,
    max_tokens=200,
)
print(response.text)

3. Pre-formatted Prompt#

Pass an already-formatted prompt:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

# Build the prompt manually
prompt = completion_to_prompt_v3_instruct("Say hello.")

# Pass formatted=True to skip the automatic formatter
response = llm.complete(prompt, formatted=True)
print(response.text)

4. Completion with Stop Tokens#

Stop generation at specific tokens:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    stop=["</s>", "<|eot_id|>", "<|end|>"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete("Say hello.")
# Stop tokens are stripped from output

5. Raw Response Access#

Access the underlying llama-cpp-python response:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete("Say hello.")
print(response.raw)  # Full dict from llama-cpp-python
print(response.raw["choices"])  # List of generated choices
print(response.raw["usage"])    # Token usage statistics

Chat Operations#

Chat is provided by CompletionToChatMixin, which formats messages using messages_to_prompt and delegates to complete().

1. Single Turn Chat#

Basic conversation:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.core.llms import Message, MessageRole
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")])]
response = llm.chat(messages)
print(response.message.content)  # "4"
print(response.message.role)     # MessageRole.ASSISTANT

2. Multi-turn Conversation#

With conversation history:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.core.llms import Message, MessageRole
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

messages = [
    Message(role=MessageRole.USER, chunks=[TextChunk(content="My name is Alex.")]),
    Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="Nice to meet you, Alex!")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What is my name?")]),
]

response = llm.chat(messages)
print(response.message.content)  # Should mention "Alex"

3. Chat with System Message#

System message for context:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.core.llms import Message, MessageRole
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

messages = [
    Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are a helpful math tutor.")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What is the square root of 144?")]),
]

response = llm.chat(messages)
print(response.message.content)  # "12"

Streaming Operations#

1. Stream Completion#

Real-time streaming completion:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

for chunk in llm.complete("Count from 1 to 5.", stream=True):
    print(chunk.delta, end="", flush=True)
    # "1" ", " "2" ", " "3" ", " "4" ", " "5"

2. Stream Chat#

Real-time streaming chat:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.core.llms import Message, MessageRole
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Tell me a joke.")])]
for chunk in llm.chat(messages, stream=True):
    print(chunk.delta, end="", flush=True)

3. Processing Stream with Delta#

Access incremental content:

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

full_response = ""
for chunk in llm.complete("Say hello.", stream=True):
    delta = chunk.delta  # Incremental content
    if delta:
        full_response += delta
        print(delta, end="", flush=True)

print(f"\n\nFull response: {full_response}")

# Alternatively, the last chunk's .text contains the full accumulated text:
chunks = list(llm.complete("Say hello.", stream=True))
final_text = chunks[-1].text  # Same as joining all deltas

Async Operations#

LlamaCPP offloads CPU-bound inference to a thread pool via asyncio.to_thread, so async calls do not block the event loop.

1. Async Completion#

Non-blocking completion:

import os
import asyncio
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)


async def main():
    llm = await asyncio.to_thread(
        LlamaCPP,
        model_path=os.environ.get("LLAMA_MODEL_PATH"),
        messages_to_prompt=messages_to_prompt_v3_instruct,
        completion_to_prompt=completion_to_prompt_v3_instruct,
    )

    response = await llm.acomplete("Say hello.")
    print(response.text)


asyncio.run(main())

2. Async Chat#

Non-blocking chat:

import os
import asyncio
from serapeum.llama_cpp import LlamaCPP
from serapeum.core.llms import Message, MessageRole
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)


async def main():
    llm = await asyncio.to_thread(
        LlamaCPP,
        model_path=os.environ.get("LLAMA_MODEL_PATH"),
        messages_to_prompt=messages_to_prompt_v3_instruct,
        completion_to_prompt=completion_to_prompt_v3_instruct,
    )

    messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])]
    response = await llm.achat(messages)
    print(response.message.content)


asyncio.run(main())

3. Async Streaming#

Non-blocking streaming completion:

import os
import asyncio
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)


async def main():
    llm = await asyncio.to_thread(
        LlamaCPP,
        model_path=os.environ.get("LLAMA_MODEL_PATH"),
        messages_to_prompt=messages_to_prompt_v3_instruct,
        completion_to_prompt=completion_to_prompt_v3_instruct,
    )

    gen = await llm.acomplete("Count to 5", stream=True)
    async for chunk in gen:
        print(chunk.delta, end="", flush=True)


asyncio.run(main())

4. Concurrent Async Requests#

Process multiple requests concurrently:

import os
import asyncio
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)


async def main():
    llm = await asyncio.to_thread(
        LlamaCPP,
        model_path=os.environ.get("LLAMA_MODEL_PATH"),
        messages_to_prompt=messages_to_prompt_v3_instruct,
        completion_to_prompt=completion_to_prompt_v3_instruct,
    )

    prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]

    tasks = [llm.acomplete(prompt) for prompt in prompts]
    responses = await asyncio.gather(*tasks)

    for prompt, response in zip(prompts, responses):
        print(f"{prompt} -> {response.text}")


asyncio.run(main())

Integration with Orchestrators#

1. With TextCompletionLLM#

Use LlamaCPP with TextCompletionLLM for structured outputs:

```python notest import os from pydantic import BaseModel from serapeum.core.output_parsers import PydanticParser from serapeum.core.llms import TextCompletionLLM from serapeum.llama_cpp import LlamaCPP from serapeum.llama_cpp.formatters.llama3 import ( messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct, )

class DummyModel(BaseModel): value: str

llm = LlamaCPP( model_path=os.environ.get("LLAMA_MODEL_PATH"), messages_to_prompt=messages_to_prompt_v3_instruct, completion_to_prompt=completion_to_prompt_v3_instruct, )

parser = PydanticParser(output_cls=DummyModel)

text_llm = TextCompletionLLM( output_parser=parser, prompt="Value: {value}", llm=llm, )

result = text_llm(value="input") print(result.value) # "input"

---

## Prompt Formatters

LlamaCPP **requires** explicit prompt formatters. Using the wrong formatter
for your model produces garbage output.

### Available Formatters

#### Llama 2 / Mistral

```python
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

Uses the [INST]...[/INST] format:

<s> [INST] <<SYS>> system prompt <</SYS>> user message [/INST] assistant reply </s>

Llama 3 Instruct#

from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

Uses the <|start_header_id|>...<|eot_id|> format:

<|start_header_id|>system<|end_header_id|>
system prompt<|eot_id|>
<|start_header_id|>user<|end_header_id|>
user message<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Formatter Behavior#

messages_to_prompt: Converts a list of Message objects to a single formatted string. Handles system, user, and assistant roles, including multi-turn conversations.

completion_to_prompt: Wraps a plain string in the model's instruct format with a default system prompt.

Custom Formatters#

You can write your own formatters for other model families:

```python notest from collections.abc import Sequence from serapeum.core.llms import Message, MessageRole from serapeum.llama_cpp import LlamaCPP

def my_messages_to_prompt(messages: Sequence[Message]) -> str: """Custom formatter for your model.""" parts = [] for msg in messages: if msg.role == MessageRole.SYSTEM: parts.append(f"[SYSTEM] {msg.content}") elif msg.role == MessageRole.USER: parts.append(f"[USER] {msg.content}") elif msg.role == MessageRole.ASSISTANT: parts.append(f"[ASSISTANT] {msg.content}") parts.append("[ASSISTANT]") return "\n".join(parts)

def my_completion_to_prompt(completion: str) -> str: """Custom formatter for completion.""" return f"[SYSTEM] You are helpful.\n[USER] {completion}\n[ASSISTANT]"

llm = LlamaCPP( model_path="/models/my-model.gguf", messages_to_prompt=my_messages_to_prompt, completion_to_prompt=my_completion_to_prompt, )

---

## Tokenization

LlamaCPP exposes the model's tokenizer for prompt length checking:

```python
import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

# Get token IDs
tokens = llm.tokenize("Hello, world!")
print(tokens)  # [1, 15043, 29892, 3186, 29991]

# Count tokens
count = llm.count_tokens("Hello, world!")
print(count)  # 5


Best Practices#

1. Reuse LLM Instances#

Create once, use many times — model loading is expensive:

```python notest import os from serapeum.llama_cpp import LlamaCPP from serapeum.llama_cpp.formatters.llama3 import ( messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct, )

Good: Create once#

llm = LlamaCPP( model_path=os.environ.get("LLAMA_MODEL_PATH"), messages_to_prompt=messages_to_prompt_v3_instruct, completion_to_prompt=completion_to_prompt_v3_instruct, )

Reuse for multiple calls (model stays loaded)#

response1 = llm.complete("Hello!") response2 = llm.complete("How are you?")

Bad: Don't recreate for each call — model reloads every time#

def process(prompt): llm = LlamaCPP(model_path="...", ...) # Expensive! return llm.complete(prompt)

### 2. Match Formatter to Model

Always use the correct formatter for your model family:

```python
from serapeum.llama_cpp import LlamaCPP
# Llama 2 / Mistral models
from serapeum.llama_cpp.formatters.llama2 import messages_to_prompt, completion_to_prompt

# Llama 3 Instruct models
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct
)

3. Handle Errors Gracefully#

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

try:
    response = llm.complete("Hello")
except ValueError as e:
    print(f"Prompt too long: {e}")
except RuntimeError as e:
    print(f"Model error: {e}")

4. Use GPU for Performance#

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

# Offload all layers to GPU
llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    n_gpu_layers=-1,
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

5. Use Temperature=0 for Deterministic Output#

import os
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

# Greedy decoding for reproducible results
llm = LlamaCPP(
    model_path=os.environ.get("LLAMA_MODEL_PATH"),
    temperature=0.0,
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

# Same prompt always produces same output
r1 = llm.complete("The capital of France is")
r2 = llm.complete("The capital of France is")
assert r1.text == r2.text

See Also#