Component Boundaries and Interactions#
This diagram shows how components interact during the complete lifecycle of the LlamaCPP LLM.
Hold "Ctrl" to enable pan & zoom
graph TB
subgraph User Space
UC[User Code]
PM[Pydantic Models]
end
subgraph LlamaCPP Core
LC[LlamaCPP Instance]
subgraph Configuration
CFG[Configuration Fields]
MD[Metadata]
end
subgraph Validation
VMP[_validate_model_path_exists]
VMS[_check_model_source]
VFM[_check_formatters]
VPK[_prepare_kwargs]
end
subgraph Model Resolution
RES[_resolve_model_path]
FURL[_fetch_model_file]
FHF[_fetch_model_file_hf]
end
subgraph Model Loading
LOAD[_load_model]
CACHE[_MODEL_CACHE]
LOCK[_MODEL_CACHE_LOCK]
end
subgraph Context Guard
GUARD[_guard_context]
TOK[tokenize / count_tokens]
end
subgraph Completion Engine
COMP[_complete]
SCOMP[_stream_complete]
KWARGS[generate_kwargs builder]
end
subgraph Chat Bridge
MIXIN[CompletionToChatMixin]
M2P[messages_to_prompt]
C2P[completion_to_prompt]
end
subgraph Async Bridge
ACOMP[acomplete]
TOTHREAD[asyncio.to_thread]
end
end
subgraph Formatter Layer
subgraph Llama 2 Formatter
L2M[messages_to_prompt]
L2C[completion_to_prompt]
end
subgraph Llama 3 Formatter
L3M[messages_to_prompt_v3_instruct]
L3C[completion_to_prompt_v3_instruct]
end
end
subgraph Model Layer
LLAMA[Llama Instance]
subgraph Inference
GEN[__call__ / generate]
VOCAB[Tokenizer / Vocabulary]
CTX[Context Window]
end
end
subgraph External Sources
LOCAL[Local GGUF File]
URL[Remote URL]
HF[HuggingFace Hub]
end
subgraph Orchestrator Layer
TCL[TextCompletionLLM]
subgraph Orchestrator Components
PRS[PydanticParser]
PTMP[PromptTemplate]
end
end
subgraph Response Models
CRESP[ChatResponse]
CORESP[CompletionResponse]
MSG[Message]
end
%% Initialization Flow
UC -->|1. Initialize| LC
LC -->|Validate fields| VMP
VMP -->|Validate cross-field| VMS
VMS -->|Validate formatters| VFM
VFM -->|Merge defaults| VPK
VPK -->|Store config| CFG
CFG -->|Create| MD
%% Model Resolution
LC -->|model_post_init| RES
RES -->|model_path set| LOCAL
RES -->|model_url set| FURL
RES -->|hf_model_id set| FHF
FURL -->|Download from| URL
FHF -->|Download from| HF
FURL -->|Save to| LOCAL
FHF -->|Save to| LOCAL
%% Model Loading
RES -->|Path resolved| LOAD
LOAD -->|Check cache| CACHE
CACHE -.->|Thread-safe| LOCK
LOAD -->|Cache miss| LLAMA
LLAMA -->|Store in cache| CACHE
%% Completion Flow
UC -->|2a. complete(prompt)| LC
LC -->|Format prompt| C2P
C2P -.->|Uses| L2C
C2P -.->|Uses| L3C
LC -->|Check length| GUARD
GUARD -->|Count tokens| TOK
TOK -->|Use vocabulary| VOCAB
LC -->|Build kwargs| KWARGS
KWARGS -->|Non-stream| COMP
KWARGS -->|Stream| SCOMP
COMP -->|Acquire lock| LOCK
COMP -->|Call| GEN
GEN -->|Use| CTX
GEN -->|Return dict| COMP
COMP -->|Create| CORESP
CORESP -->|Return| UC
SCOMP -->|Acquire lock| LOCK
SCOMP -->|Stream call| GEN
GEN -.->|Yield chunks| SCOMP
SCOMP -.->|Yield| CORESP
CORESP -.->|Yield| UC
%% Chat Flow (via mixin)
UC -->|2b. chat(messages)| MIXIN
MIXIN -->|Format messages| M2P
M2P -.->|Uses| L2M
M2P -.->|Uses| L3M
M2P -->|Formatted string| LC
LC -->|Delegate to complete| COMP
COMP -->|CompletionResponse| MIXIN
MIXIN -->|Wrap in| CRESP
CRESP -->|Contains| MSG
CRESP -->|Return| UC
%% Async Flow
UC -->|2c. acomplete(prompt)| ACOMP
ACOMP -->|Offload to| TOTHREAD
TOTHREAD -->|Call sync| COMP
COMP -->|Return| TOTHREAD
TOTHREAD -->|Return| ACOMP
ACOMP -->|Return| UC
%% TextCompletionLLM Integration
UC -->|3. TextCompletionLLM(llm=LlamaCPP)| TCL
TCL -->|Store| LC
TCL -->|Use| PTMP
TCL -->|Use| PRS
UC -->|Call| TCL
TCL -->|Format prompt| PTMP
PTMP -->|Messages| LC
LC -->|chat| CRESP
CRESP -->|message.content| PRS
PRS -->|Parse JSON| PM
PM -->|Return| UC
%% Styling
classDef userClass fill:#e1f5ff,stroke:#01579b
classDef llamaClass fill:#e0f2f1,stroke:#004d40
classDef configClass fill:#fff9c4,stroke:#f57f17
classDef modelClass fill:#f3e5f5,stroke:#4a148c
classDef formatterClass fill:#e8f5e9,stroke:#1b5e20
classDef sourceClass fill:#efebe9,stroke:#3e2723
classDef orchestratorClass fill:#fce4ec,stroke:#880e4f
classDef responseClass fill:#fff3e0,stroke:#e65100
class UC,PM userClass
class LC llamaClass
class CFG,MD,VMP,VMS,VFM,VPK,GUARD,TOK,COMP,SCOMP,KWARGS,MIXIN,M2P,C2P,ACOMP,TOTHREAD configClass
class LLAMA,GEN,VOCAB,CTX modelClass
class RES,LOAD,CACHE,LOCK,FURL,FHF formatterClass
class LOCAL,URL,HF sourceClass
class TCL,PRS,PTMP orchestratorClass
class CRESP,CORESP,MSG responseClass
class L2M,L2C,L3M,L3C formatterClass
Component Interaction Patterns#
1. Initialization Pattern#
User Code
└─→ LlamaCPP.__init__
├─→ Validators (field_validator, model_validator):
│ ├─→ _validate_model_path_exists: Check local file exists
│ ├─→ _check_model_source: Ensure one source is set
│ ├─→ _check_formatters: Ensure formatters are provided
│ └─→ _prepare_kwargs: Merge n_ctx, verbose, n_gpu_layers into model_kwargs
├─→ Store: model_path, model_url, temperature, max_new_tokens, etc.
├─→ Create Metadata: is_chat_model=True, is_function_calling_model=False
└─→ model_post_init:
├─→ _resolve_model_path: Download if needed, return local Path
└─→ _load_model: Load Llama with caching
2. Model Resolution Pattern#
_resolve_model_path()
├─→ If model_path is set:
│ └─→ Return Path(model_path)
├─→ If hf_model_id is set:
│ ├─→ Create cache directory
│ ├─→ Call _fetch_model_file_hf(repo_id, filename, cache_dir)
│ │ └─→ hf_hub_download → Path
│ └─→ Update self.model_path, return Path
└─→ Else (model_url or default):
├─→ Compute cache path from URL filename
├─→ If not cached:
│ └─→ Call _fetch_model_file(url, path)
│ └─→ requests.get(stream=True) with progress bar
└─→ Update self.model_path, return Path
3. Model Loading Pattern (with Cache)#
_load_model(model_path)
├─→ Compute cache_key = (str(path), json(model_kwargs))
├─→ With _MODEL_CACHE_LOCK: check cache
│ └─→ result = _MODEL_CACHE.get(cache_key)
├─→ If cache miss:
│ ├─→ loaded = Llama(model_path, **model_kwargs)
│ └─→ With _MODEL_CACHE_LOCK: store if still missing
└─→ Return result
4. Completion Request Pattern#
User → LlamaCPP.complete(prompt, formatted, stream, **kwargs)
├─→ If not formatted:
│ └─→ prompt = self.completion_to_prompt(prompt)
├─→ If stream:
│ └─→ _stream_complete(prompt, **kwargs)
│ ├─→ _guard_context(prompt) → count tokens, raise if too long
│ ├─→ Build call_kwargs from generate_kwargs + defaults
│ ├─→ Acquire _model_lock
│ └─→ Yield CompletionResponse per chunk
└─→ Else:
└─→ _complete(prompt, **kwargs)
├─→ _guard_context(prompt)
├─→ Build call_kwargs
├─→ Acquire _model_lock
├─→ self._model(prompt=prompt, **call_kwargs)
└─→ Return CompletionResponse(text=choices[0]["text"])
5. Chat via Mixin Pattern#
User → LlamaCPP.chat(messages, stream, **kwargs)
└─→ CompletionToChatMixin.chat(messages, stream, **kwargs)
├─→ formatted = self.messages_to_prompt(messages)
├─→ result = self.complete(formatted, formatted=True, stream, **kwargs)
└─→ If stream:
│ └─→ Convert each CompletionResponse chunk → ChatResponse chunk
└─→ Else:
└─→ Wrap CompletionResponse → ChatResponse
6. Async Pattern#
User → LlamaCPP.acomplete(prompt, formatted, stream, **kwargs)
├─→ If stream:
│ ├─→ chunks = await asyncio.to_thread(lambda: list(self.complete(..., stream=True)))
│ └─→ Return async generator yielding chunks
└─→ Else:
└─→ await asyncio.to_thread(self.complete, prompt, formatted, stream=False, **kwargs)
7. TextCompletionLLM Integration Pattern#
User → TextCompletionLLM(output_parser=parser, prompt=prompt, llm=LlamaCPP(...))
└─→ TextCompletionLLM stores LlamaCPP instance
User → text_llm(key="value")
├─→ Check llm.metadata.is_chat_model → True
├─→ Format prompt with variables → List[Message]
├─→ LlamaCPP.chat(messages) → ChatResponse
├─→ Extract response.message.content
├─→ PydanticParser.parse(content) → Model instance
└─→ Return Model instance
Component State Management#
LlamaCPP Instance State#
Initialization:
- model_path: str (set during init or model_post_init)
- model_url: Optional[str] (immutable after init)
- hf_model_id: Optional[str] (immutable after init)
- temperature: float (immutable after init)
- max_new_tokens: int (immutable after init)
- context_window: int (immutable after init)
- n_gpu_layers: int (immutable after init)
- stop: list[str] (immutable after init)
- generate_kwargs: dict (immutable after init)
- model_kwargs: dict (immutable after init)
- _model: Llama (loaded during model_post_init)
- _model_lock: threading.Lock (per-instance)
Runtime:
- _model: Llama instance (loaded, never changes)
- _model_lock: Acquired during each inference call
Model Cache State (Module-level)#
_MODEL_CACHE: WeakValueDictionary
- Keys: (model_path_str, model_kwargs_json)
- Values: Llama instances (weak references)
- Entries auto-removed when all LlamaCPP refs are GC'd
_MODEL_CACHE_LOCK: threading.Lock
- Protects read/write to _MODEL_CACHE
Request State (Per Call)#
Input:
- prompt: str
- formatted: bool
- stream: bool
- **kwargs: Additional generation options
Processing:
- formatted_prompt: str (after completion_to_prompt if needed)
- call_kwargs: dict (merged generate_kwargs + defaults + overrides)
- raw_response: dict from Llama backend
Output:
- CompletionResponse or CompletionResponseGen
Error Boundaries#
1. Configuration Errors (Initialization)#
Validators:
├─→ _validate_model_path_exists: ValueError if path doesn't exist
├─→ _check_model_source: ValueError if no source specified
├─→ _check_model_source: ValueError if hf_model_id without hf_filename
└─→ _check_formatters: ValueError if formatters missing
2. Model Resolution Errors (model_post_init)#
_resolve_model_path:
├─→ _fetch_model_file: ValueError (Content-Length < 1 MB)
├─→ _fetch_model_file: requests.ConnectionError
├─→ _fetch_model_file: RuntimeError (download succeeded but file missing)
├─→ _fetch_model_file_hf: ImportError (huggingface-hub not installed)
└─→ _fetch_model_file_hf: Network/auth errors from hf_hub_download
3. Inference Errors (During Call)#
_guard_context:
└─→ ValueError if prompt tokens exceed context_window
_complete / _stream_complete:
└─→ RuntimeError from llama-cpp-python backend
Component Dependencies#
LlamaCPP Depends On:#
llama_cpp.Llama(external package — model runtime)serapeum.core.llms.LLM(base class)serapeum.core.llms.CompletionToChatMixin(chat bridge)serapeum.core.llmstypes (Message, CompletionResponse, ChatResponse, Metadata)serapeum.core.utils.base.get_cache_dir(cache directory)pydantic(for configuration and validation)requests(for URL downloads)huggingface_hub(optional, for HF downloads)
LlamaCPP Is Used By:#
TextCompletionLLM(as the LLM engine)- Direct user code (standalone usage)
External Dependencies:#
- GGUF Model File: Must be available locally or downloadable
- llama-cpp-python: Must be compiled for the target platform (CPU/CUDA/Metal)