LlamaCPP LLM Integration#
This directory contains comprehensive documentation explaining the complete workflow of the LlamaCPP class, from initialization to execution across various modes (completion, chat, streaming, async).
Overview#
The LlamaCPP class is a local LLM integration that provides:
1. Local GGUF model inference via llama-cpp-python
2. Chat and completion APIs with sync/async support
3. Streaming responses for real-time output
4. Multiple model sources — local path, URL download, or HuggingFace Hub
5. Model caching — shared memory across instances
6. Thread-safe inference — serialized calls to the C backend
Example Usage#
Basic Completion#
```python notest from serapeum.llama_cpp import LlamaCPP from serapeum.llama_cpp.formatters.llama3 import ( messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct, )
llm = LlamaCPP( model_path="/models/llama-3-8b-instruct.Q4_0.gguf", temperature=0.1, max_new_tokens=256, context_window=8192, messages_to_prompt=messages_to_prompt_v3_instruct, completion_to_prompt=completion_to_prompt_v3_instruct, )
response = llm.complete("Hello, how are you?") print(response.text)
### With TextCompletionLLM
```python notest
from pydantic import BaseModel
from serapeum.core.output_parsers import PydanticParser
from serapeum.core.llms import TextCompletionLLM
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
messages_to_prompt_v3_instruct,
completion_to_prompt_v3_instruct,
)
class DummyModel(BaseModel):
value: str
llm = LlamaCPP(
model_path="/models/llama-3-8b-instruct.Q4_0.gguf",
messages_to_prompt=messages_to_prompt_v3_instruct,
completion_to_prompt=completion_to_prompt_v3_instruct,
)
text_llm = TextCompletionLLM(
output_parser=PydanticParser(output_cls=DummyModel),
prompt="Generate any value: {value}",
llm=llm,
)
result = text_llm(value="input")
# Returns: DummyModel(value="input")
Basic Chat#
```python notest from serapeum.llama_cpp import LlamaCPP from serapeum.core.llms import Message, MessageRole from serapeum.llama_cpp.formatters.llama3 import ( messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct, )
llm = LlamaCPP( model_path="/models/llama-3-8b-instruct.Q4_0.gguf", messages_to_prompt=messages_to_prompt_v3_instruct, completion_to_prompt=completion_to_prompt_v3_instruct, )
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Say hello.")])] response = llm.chat(messages) print(response.message.content)
## Understanding the Workflow
### 1. [Execution Flow and Method Calls](./llama_cpp_sequence.md)
Shows the chronological flow of method calls and interactions across all usage patterns.
**Best for**:
- Understanding the order of operations
- Seeing how LlamaCPP loads and invokes models
- Debugging execution flow
- Understanding the completion-to-chat bridge
**Key Flows**:
- Initialization phase (model download / load)
- Direct completion calls
- Chat via CompletionToChatMixin
- Streaming execution
- Integration with TextCompletionLLM
- Async operations via thread pool
### 2. [Architecture and Class Relationships](./llama_cpp_class.md)
Illustrates the static structure, inheritance hierarchy, and relationships.
**Best for**:
- Understanding the architecture
- Seeing inheritance chain (BaseLLM → LLM → CompletionToChatMixin → LlamaCPP)
- Identifying class responsibilities
- Understanding integration points
**Key Classes**:
- `LlamaCPP`: Main LLM implementation
- `CompletionToChatMixin`: Bridges completion to chat interface
- `LLM`: High-level orchestration
- `BaseLLM`: Core interface
- `Llama`: Underlying llama-cpp-python model
- Response models: `ChatResponse`, `CompletionResponse`, `Message`
### 3. [Data Transformations and Validation](./llama_cpp_dataflow.md)
Tracks how data transforms through the system across different operation modes.
**Best for**:
- Understanding data transformations
- Identifying validation points
- Seeing error handling paths
- Understanding request/response formats
**Key Flows**:
- Initialization, model resolution, and loading
- Completion request building and response parsing
- Chat via prompt formatting bridge
- Streaming chunk processing
- Error handling pipelines
### 4. [Component Boundaries and Interactions](./llama_cpp_components.md)
Shows component boundaries, responsibilities, and interaction patterns.
**Best for**:
- Understanding system architecture
- Seeing component responsibilities
- Identifying interaction patterns
- Understanding the local inference stack
**Key Components**:
- User space (application code)
- LlamaCPP core (prompt formatting, context guard, generation)
- Model layer (Llama instance, model cache)
- Formatter layer (Llama 2 / Llama 3 prompt templates)
- Orchestrator layer (TextCompletionLLM)
### 5. [Lifecycle and State Management](./llama_cpp_state.md)
Depicts the lifecycle states, transitions, and state variables.
**Best for**:
- Understanding instance lifecycle
- Seeing state transitions
- Identifying error states
- Understanding concurrency considerations
**Key States**:
- Uninitialized → ModelLoading (initialization)
- ModelLoading → Ready (model loaded)
- Ready ↔ Processing* (request handling)
- Processing → Error → Ready (error handling)
### 6. [Usage Examples](./examples.md)
Comprehensive examples from real test cases.
**Best for**:
- Learning by example
- Understanding practical usage
- Seeing all API variants
- Integration patterns
**Key Examples**:
- Basic completion and chat
- Streaming operations
- Model loading from different sources
- Async operations
- Error handling
- Prompt formatters
## Core Capabilities
### 1. Completion API
## Key Design Patterns
### 1. **Eager Initialization**
Model is loaded during `model_post_init`, not on first use:
```python notest
def model_post_init(self, __context):
model_path = self._resolve_model_path()
self._model = self._load_model(model_path)
2. CompletionToChatMixin (Adapter Pattern)#
Chat is derived from completion, the inverse of Ollama: ```python notest
CompletionToChatMixin provides chat() by:#
1. Formatting messages with messages_to_prompt()#
2. Calling complete() with the formatted string#
3. Wrapping the result in a ChatResponse#
### 3. **Model Cache (Flyweight Pattern)**
Models are cached in a module-level WeakValueDictionary:
```python notest
_MODEL_CACHE: WeakValueDictionary[tuple[str, str], Llama] = WeakValueDictionary()
def _load_model(self, model_path: Path) -> Llama:
cache_key = (str(model_path), json.dumps(self.model_kwargs, sort_keys=True))
# Double-checked locking for thread safety
...
4. Thread Safety#
Inference is serialized per instance to prevent C-level races: ```python notest with self._model_lock: response = self._model(prompt=prompt, **call_kwargs)
User Application ↓ TextCompletionLLM ↓ LlamaCPP ↓ Prompt Formatter (Llama 2 / Llama 3) ↓ Llama (llama-cpp-python) ↓ GGUF Model File (local) 1. Formats prompt with variables 2. Checks is_chat_model → True 3. Calls LlamaCPP.chat() 4. Parses response with PydanticParser 5. Returns validated model instance## Performance Considerations
1. **Model Caching**: WeakValueDictionary shares model across instances
2. **Thread Pool**: Async calls offloaded via `asyncio.to_thread`
3. **Context Guard**: Early rejection of oversized prompts before inference
4. **GPU Offloading**: `n_gpu_layers` parameter for GPU acceleration
5. **Streaming**: Reduces perceived latency for long responses
6. **Lock Granularity**: Per-instance lock for concurrent callers
## Configuration Options
### Model Source (one required)
- `model_path`: Path to a local GGUF file
- `model_url`: URL to download a GGUF file
- `hf_model_id` + `hf_filename`: HuggingFace Hub repository and filename
### Generation
- `temperature`: Sampling temperature (0.0–1.0, default: 0.75)
- `max_new_tokens`: Maximum tokens to generate (default: 256)
- `context_window`: Maximum context tokens (default: 3900)
- `stop`: List of stop token sequences (default: [])
### Model Loading
- `n_gpu_layers`: Layers to offload to GPU (0 = CPU only, -1 = all)
- `verbose`: Print llama.cpp verbose output (default: False)
- `model_kwargs`: Additional kwargs for `Llama()` constructor
- `generate_kwargs`: Additional kwargs for generation calls
### Prompt Formatting (required)
- `messages_to_prompt`: Callable to convert messages to prompt string
- `completion_to_prompt`: Callable to wrap a completion string in instruct format
## Error Handling
### Configuration Errors
## Prerequisites
### System Requirements
```bash
# Install llama-cpp-python (CPU)
pip install llama-cpp-python
# Or with GPU support (CUDA)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
# Or with Metal support (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
Python Requirements#
# Install serapeum-llama-cpp
pip install serapeum-llama-cpp
# Or install from source
uv pip install -e libs/providers/llama-cpp
Model Files#
# Download a GGUF model (e.g., from HuggingFace)
# Option 1: Let LlamaCPP download automatically via model_url or hf_model_id
# Option 2: Download manually
wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf
Common Patterns#
Pattern 1: Reusable Instance#
```python notest from serapeum.llama_cpp import LlamaCPP from serapeum.llama_cpp.formatters.llama3 import ( messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct, )
Create once (model loads here)#
llm = LlamaCPP( model_path="/models/llama-3-8b-instruct.Q4_0.gguf", messages_to_prompt=messages_to_prompt_v3_instruct, completion_to_prompt=completion_to_prompt_v3_instruct, )
Reuse many times (no re-loading)#
response1 = llm.complete("Hello!") response2 = llm.complete("How are you?")
### Pattern 2: Streaming for Long Responses
```python notest
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
messages_to_prompt_v3_instruct,
completion_to_prompt_v3_instruct,
)
llm = LlamaCPP(
model_path="/models/llama-3-8b-instruct.Q4_0.gguf",
messages_to_prompt=messages_to_prompt_v3_instruct,
completion_to_prompt=completion_to_prompt_v3_instruct,
)
for chunk in llm.complete("Tell me a joke.", stream=True):
print(chunk.delta, end="", flush=True)
Pattern 3: GPU Acceleration#
```python notest from serapeum.llama_cpp import LlamaCPP from serapeum.llama_cpp.formatters.llama3 import ( messages_to_prompt_v3_instruct, completion_to_prompt_v3_instruct, )
llm = LlamaCPP( model_path="/models/llama-3-8b-instruct.Q4_0.gguf", n_gpu_layers=-1, # Offload all layers to GPU messages_to_prompt=messages_to_prompt_v3_instruct, completion_to_prompt=completion_to_prompt_v3_instruct, )
### Pattern 4: Async for Concurrency
```python notest
import asyncio
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
messages_to_prompt_v3_instruct,
completion_to_prompt_v3_instruct,
)
async def main():
# Wrap construction in to_thread to avoid blocking
llm = await asyncio.to_thread(
LlamaCPP,
model_path="/models/llama-3-8b-instruct.Q4_0.gguf",
messages_to_prompt=messages_to_prompt_v3_instruct,
completion_to_prompt=completion_to_prompt_v3_instruct,
)
response = await llm.acomplete("Say hello.")
print(response.text)
asyncio.run(main())
Troubleshooting#
Issue: Model Loading Takes Too Long#
Solution: Use n_gpu_layers to offload to GPU, or use a smaller quantized model.
llm = LlamaCPP(model_path="...", n_gpu_layers=-1, ...)
Issue: Garbage Output#
Solution: Ensure the prompt formatter matches the model family.
Llama 2 / Mistral → formatters.llama2.messages_to_prompt
Llama 3 → formatters.llama3.messages_to_prompt_v3_instruct
Issue: Prompt Exceeds Context Window#
Solution: Increase context_window or shorten the prompt.
llm = LlamaCPP(model_path="...", context_window=8192, ...)
Issue: ImportError for HuggingFace#
Issue: Blocking the Event Loop#
Solution: Wrap construction in asyncio.to_thread.
llm = await asyncio.to_thread(LlamaCPP, model_path="...", ...)
Next Steps#
- Start with Examples for practical usage patterns
- Review Sequence Diagrams to understand execution flow
- Study Class Diagram to understand architecture
- Explore Data Flow to understand transformations
- Check State Management for lifecycle details
See Also#
- TextCompletionLLM - Structured completion orchestrator
- Ollama Provider - Remote Ollama server integration
- llama-cpp-python Documentation - Backend library