Ollama LLM Integration#
This directory contains comprehensive documentation explaining the complete workflow of the Ollama class, from initialization to execution across various modes (chat, completion, streaming, tool calling, async).
Overview#
The Ollama class is a production-ready LLM integration that provides:
1. Connection to Ollama server (local or remote)
2. Chat and completion APIs with sync/async support
3. Streaming responses for real-time output
4. Tool/function calling for structured interactions
5. Integration with orchestrators (TextCompletionLLM, ToolOrchestratingLLM)
Example Usage#
Basic Chat#
import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama
# Initialize Ollama
llm = Ollama(
model="qwen3.5:397b",
api_key=os.environ.get("OLLAMA_API_KEY"),
timeout=180,
)
# Send chat request
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Say 'pong'.")])]
response = llm.chat(messages)
print(response.message.content) # "Pong!"
With TextCompletionLLM#
import os
from pydantic import BaseModel
from serapeum.core.output_parsers import PydanticParser
from serapeum.core.llms import TextCompletionLLM
from serapeum.ollama import Ollama
class DummyModel(BaseModel):
value: str
# Initialize Ollama
llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"), timeout=180)
# Create structured completion runner
text_llm = TextCompletionLLM(
output_parser=PydanticParser(output_cls=DummyModel),
prompt="Generate any value: {value}",
llm=llm,
)
# Execute and get structured output
result = text_llm(value="input")
# Returns: DummyModel(value="input")
With ToolOrchestratingLLM#
import os
from pydantic import BaseModel
from serapeum.core.llms import ToolOrchestratingLLM
from serapeum.ollama import Ollama
class Album(BaseModel):
title: str
artist: str
songs: list[str]
# Initialize Ollama
llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"), timeout=180)
# Create tool orchestrator
tools_llm = ToolOrchestratingLLM(
schema=Album,
prompt="Create an album about {topic} with two random songs",
llm=llm,
)
# Execute and get structured output via tool calling
result = tools_llm(topic="rock")
# Returns: Album(title="...", artist="...", songs=[...])
Understanding the Workflow#
1. Execution Flow and Method Calls#
Shows the chronological flow of method calls and interactions across all usage patterns.
Best for: - Understanding the order of operations - Seeing how Ollama communicates with the server - Debugging execution flow - Understanding integration patterns
Key Flows: - Initialization phase (lazy client creation) - Direct chat/completion calls - Tool calling with schema conversion - Streaming execution - Integration with TextCompletionLLM and ToolOrchestratingLLM - Async operations
2. Architecture and Class Relationships#
Illustrates the static structure, inheritance hierarchy, and relationships.
Best for: - Understanding the architecture - Seeing inheritance chain (BaseLLM → LLM → FunctionCallingLLM → Ollama) - Identifying class responsibilities - Understanding integration points
Key Classes:
- Ollama: Main LLM implementation
- FunctionCallingLLM: Tool calling abstraction
- LLM: High-level orchestration
- BaseLLM: Core interface
- Client/AsyncClient: HTTP communication
- Response models: ChatResponse, CompletionResponse, Message
3. Data Transformations and Validation#
Tracks how data transforms through the system across different operation modes.
Best for: - Understanding data transformations - Identifying validation points - Seeing error handling paths - Understanding request/response formats
Key Flows: - Initialization and configuration - Chat request building and response parsing - Completion via decorator pattern - Tool schema conversion - Streaming chunk processing - Error handling pipelines
4. Component Boundaries and Interactions#
Shows component boundaries, responsibilities, and interaction patterns.
Best for: - Understanding system architecture - Seeing component responsibilities - Identifying interaction patterns - Understanding integration layers
Key Components: - User space (application code) - Ollama core (request building, response parsing, tool handling) - Client layer (HTTP communication) - Ollama server (model runtime, inference) - Orchestrator layer (TextCompletionLLM, ToolOrchestratingLLM)
5. Lifecycle and State Management#
Depicts the lifecycle states, transitions, and state variables.
Best for: - Understanding instance lifecycle - Seeing state transitions - Identifying error states - Understanding concurrency considerations
Key States: - Uninitialized → Configured (initialization) - Configured → ClientInitialized (lazy client creation) - Idle ↔ Processing* (request handling) - Processing → Error → Idle (error handling)
6. Usage Examples#
Comprehensive examples from real test cases.
Best for: - Learning by example - Understanding practical usage - Seeing all API variants - Integration patterns
Key Examples: - Basic chat and completion - Streaming operations - Tool/function calling - Integration with orchestrators - Async operations - Error handling
Core Capabilities#
1. Chat API#
Direct conversation with the model:
- Single and multi-turn conversations
- System messages for context
- Image inputs (if model supports)
- Custom parameters (temperature, top_p, etc.)
2. Completion API#
Text completion via decorator pattern:
- Converts prompt to chat message
- Delegates to chat API
- Extracts text from response
3. Streaming#
Real-time response generation:
- Stream chat responses
- Stream completion responses
- Chunk-by-chunk processing
- Delta content access
4. Tool/Function Calling#
Structured interactions with tools:
- Automatic schema conversion
- Single or parallel tool calls
- Tool call validation
- Streaming tool calls
5. Async Operations#
Non-blocking execution:
- Async chat and completion
- Async streaming
- Concurrent request handling
- Separate async client per event loop
Key Design Patterns#
1. Lazy Initialization#
Clients are created on first use, not during __init__:
```python notest
@property
def client(self) -> Client:
if self._client is None:
self._client = Client(host=self.base_url, timeout=self.timeout)
return self._client
### 2. **Template Method Pattern**
FunctionCallingLLM defines workflow, Ollama implements specifics:
```python
def generate_tool_calls(self, messages, tools, **kwargs):
prepared = self._prepare_chat_with_tools(messages, tools, **kwargs) # Subclass
response = self.chat(prepared)
validated = self._validate_response(response, tools) # Subclass
return validated
3. Adapter Pattern#
Ollama adapts between internal types and Ollama server format:
- Message → Ollama message dict
- BaseTool → Ollama tool schema
- Raw response dict → ChatResponse/CompletionResponse
Integration Architecture#
User Application
↓
ToolOrchestratingLLM / TextCompletionLLM
↓
Ollama
↓
Client / AsyncClient
↓
Ollama Server (HTTP)
↓
Model Runtime (llama3.1, etc.)
TextCompletionLLM Integration#
1. Formats prompt with variables
2. Checks is_chat_model → True
3. Calls Ollama.chat()
4. Parses response with PydanticParser
5. Returns validated model instance
ToolOrchestratingLLM Integration#
1. Converts output_cls to CallableTool
2. Formats prompt with variables
3. Calls Ollama.generate_tool_calls()
4. Ollama converts tool to schema
5. Server returns tool_calls
6. Executes tool to create instance
7. Returns model instance(s)
Performance Considerations#
- Client Reuse: Client created once and reused for all requests
- Async Support: Separate async client for concurrent operations
- Streaming: Reduces latency for long responses
- Connection Pooling: HTTP client handles connection reuse
- Lazy Initialization: Only create clients when needed
Configuration Options#
Essential#
model: Model name (e.g., "qwen3.5:397b")base_url: Ollama server URL (default: "http://localhost:11434")timeout: Timeout in seconds (default: 60.0)
Generation#
temperature: Sampling temperature (0.0-1.0, default: 0.75)context_window: Maximum context tokens (default: 3900)json_mode: Force JSON output (default: False)
Advanced#
keep_alive: Model keep-alive duration (default: None)additional_kwargs: Additional Ollama optionsclient: Pre-configured client (default: None, lazy-created)async_client: Pre-configured async client (default: None, lazy-created)
Error Handling#
Network Errors#
TimeoutError: Request timeout exceeded
ConnectionError: Cannot reach Ollama server
HTTPError: Server returned error status
Parsing Errors#
JSONDecodeError: Invalid JSON response
KeyError: Missing required field in response
ValueError: Invalid response format
Configuration Errors#
ValueError: Invalid model or URL
TypeError: Missing required field
AssertionError: Invalid parameter value
Prerequisites#
Server Requirements#
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull model
ollama pull llama3.1
# Start server (runs on port 11434 by default)
ollama serve
Python Requirements#
# Install serapeum-ollama
pip install serapeum-ollama
# Or install from source
uv pip install -e libs/providers/serapeum-ollama
Common Patterns#
Pattern 1: Reusable Instance#
# Create once
import os
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk
llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"), timeout=180)
# Reuse many times
messages1 = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Hi!")])]
messages2 = [Message(role=MessageRole.USER, chunks=[TextChunk(content="How are you?")])]
response1 = llm.chat(messages1)
response2 = llm.chat(messages2)
print(response1.message.content) # "Hi!"
print(response2.message.content) # "How are you?"
Pattern 2: Streaming for Long Responses#
import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama
llm = Ollama(
model="qwen3.5:397b",
api_key=os.environ.get("OLLAMA_API_KEY"),
timeout=180
)
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Tell me a joke.")])]
for chunk in llm.chat(messages, stream=True):
print(f"{chunk.message.content}\n", end="", flush=True)
# I
# I'm
# I'm reading
# I'm reading a
# I'm reading a book
# I'm reading a book on
# I'm reading a book on anti
# I'm reading a book on anti-gr
# I'm reading a book on anti-gravity
# I'm reading a book on anti-gravity.
# I'm reading a book on anti-gravity. It
# I'm reading a book on anti-gravity. It's
# I'm reading a book on anti-gravity. It's impossible
# I'm reading a book on anti-gravity. It's impossible to
# I'm reading a book on anti-gravity. It's impossible to put
# I'm reading a book on anti-gravity. It's impossible to put down
# I'm reading a book on anti-gravity. It's impossible to put down!
# I'm reading a book on anti-gravity. It's impossible to put down! 📚
# I'm reading a book on anti-gravity. It's impossible to put down! 📚
# I'm reading a book on anti-gravity. It's impossible to put down! 📚
Pattern 3: Tool Calling for Structured Outputs#
import os
from pydantic import BaseModel
from serapeum.core.tools import CallableTool
from serapeum.ollama import Ollama
# Define your output schema
class Song(BaseModel):
title: str
duration: int # in seconds
class Album(BaseModel):
title: str
artist: str
songs: list[Song]
llm = Ollama(
model="qwen3.5:397b",
api_key=os.environ.get("OLLAMA_API_KEY"),
timeout=180
)
# Define tool from Pydantic model
tool = CallableTool.from_model(Album)
# Get structured output via tool calling
response = llm.generate_tool_calls(tools=[tool], message="Create an album about rock")
# print(response.message.additional_kwargs["tool_calls"])
# [
# ToolCall(
# function=Function(
# name='Album',
# arguments={'title': 'Thunder & Lightning', 'artist': 'The Rock Legends',
# 'songs': [
# {'duration': 245, 'title': 'Electric Storm'},
# {'duration': 312, 'title': 'Midnight Rider'},
# {'duration': 278, 'title': 'Breaking Chains'},
# {'duration': 295, 'title': 'Highway to Glory'},
# {'duration': 267, 'title': 'Rebel Heart'},
# {'duration': 334, 'title': 'Stone Cold Blues'},
# {'duration': 289, 'title': 'Rise Up'},
# {'duration': 356, 'title': 'Last Stand'}
# ]
# }
# )
# )
# ]
Pattern 4: Async for Concurrency#
import os
import asyncio
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama
async def main():
llm = Ollama(
model="qwen3.5:397b",
api_key=os.environ.get("OLLAMA_API_KEY"),
timeout=180
)
# Prepare multiple message lists
message_list = [
[Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")])],
[Message(role=MessageRole.USER, chunks=[TextChunk(content="What is the capital of France?")])],
[Message(role=MessageRole.USER, chunks=[TextChunk(content="What is Python?")])],
]
# Process multiple requests concurrently
tasks = [llm.achat(messages) for messages in message_list]
responses = await asyncio.gather(*tasks)
# Print results
for i, response in enumerate(responses):
print(f"Response {i+1}: {response.message.content}")
# Run the async function
asyncio.run(main())
# Response 1: 2 + 2 equals **4**.
# Response 2: The capital of France is **Paris**.
# Response 3: **Python** is a high-level, interpreted, general-purpose programming language. It is one of the most popular and widely used languages in the world today.
# Here is a breakdown of what makes Python unique:
# ### 1. Key Characteristics
# * **Readable Syntax:** Python was designed to be easy to read and write. It uses English keywords frequently and uses indentation (whitespace) to define blocks of code, rather than curly braces `{}`.
# * **Interpreted:** Code is executed line-by-line, which makes debugging easier and allows for interactive programming.
# * **Dynamic Typing:** You do not need to declare the type of a variable (e.g., integer, string) explicitly; Python figures it out automatically.
# * **Cross-Platform:** It works on Windows, macOS, Linux, and many other operating systems.
# ### 2. Common Use Cases
# Python is versatile and is used in many different fields:
# * **Web Development:** Building websites and backends (using frameworks like Django and Flask).
# * **Data Science & Analysis:** Analyzing large datasets (using libraries like Pandas and NumPy).
# * **Artificial Intelligence & Machine Learning:** Creating AI models (using TensorFlow, PyTorch, and scikit-learn).
# * **Automation & Scripting:** Automating repetitive tasks, file management, and system administration.
# * **Software Development:** Building desktop applications and testing software.
# ### 3. Why Is It So Popular?
# * **Ease of Learning:** It is often recommended as the first language for beginners because the syntax looks close to plain English.
# * **Huge Ecosystem:** It has a massive standard library and thousands of third-party packages (managed by **pip**) that allow you to do almost anything without starting from scratch.
# * **Strong Community:** There is a vast community of developers who contribute to documentation, tutorials, and support forums.
# ### 4. A Simple Example
# Here is what a simple Python program looks like. It prints text to the screen and performs a calculation:
# ```python
# # This is a comment
# print("Hello, World!")
# name = "Alice"
# age = 30
# print(f"{name} is {age} years old.")
# ```
# ### 5. History
# * **Creator:** Guido van Rossum.
# * **Released:** First released in **1991**.
# * **Name:** It was named after the British comedy troupe *Monty Python*, not the snake.
# In summary, Python is a powerful tool that balances ease of use with robust capabilities, making it suitable for both beginners and expert engineers.
Troubleshooting#
Issue: Connection Refused#
Issue: Model Not Found#
Issue: Timeout#
Issue: Invalid JSON Response#
Next Steps#
- Start with Examples for practical usage patterns
- Review Sequence Diagrams to understand execution flow
- Study Class Diagram to understand architecture
- Explore Data Flow to understand transformations
- Check State Management for lifecycle details
See Also#
- TextCompletionLLM - Structured completion orchestrator
- ToolOrchestratingLLM - Tool-based orchestrator
- Ollama Official Documentation - Ollama server documentation