Skip to content

Ollama Usage Examples#

This guide provides comprehensive examples covering all possible ways to use the Ollama LLM class based on real test cases from the codebase.

Prerequisites: Ollama Cloud API Key#

The examples in this guide use the Ollama Cloud inference API, which requires an API key.

Steps to create your API key:

  1. Create an account at ollama.com (or sign in if you already have one)
  2. Navigate to ollama.com/settings/keys
  3. Click Generate to create a new API key
  4. Copy the key immediately — it will not be shown again

Set the environment variable:

export OLLAMA_API_KEY=your_api_key_here

Or add it to your .env file:

OLLAMA_API_KEY=your_api_key_here

Loading the .env file in Python:

Install python-dotenv:

pip install python-dotenv

Then load it at the top of your script:

```python notest from dotenv import load_dotenv

load_dotenv() # loads variables from .env into os.environ

All examples below read the key via `os.environ.get("OLLAMA_API_KEY")`.

---

## Table of Contents

1. [Basic Usage](#basic-usage)
2. [Initialization Patterns](#initialization-patterns)
3. [Chat Operations](#chat-operations)
4. [Completion Operations](#completion-operations)
5. [Streaming Operations](#streaming-operations)
6. [Tool/Function Calling](#toolfunction-calling)
7. [Integration with Orchestrators](#integration-with-orchestrators)
8. [Async Operations](#async-operations)

---

## Basic Usage

### Simple Chat

The most straightforward way to use `Ollama`:

```python
import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

# Initialize Ollama LLM
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180,
)

# Create a message
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Say 'pong'.")])]

# Send chat request
response = llm.chat(messages)
print(response.message.content)  # "Pong!"

Simple Completion#

Using the completion API:

import os
from serapeum.ollama import Ollama

# Initialize Ollama LLM
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180,
)

# Send completion request
response = llm.complete("Say 'pong'.")
print(response.text)  # "Pong!"

Initialization Patterns#

1. Basic Initialization#

Minimal configuration:

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY")
)

2. Full Configuration#

With all common parameters:

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    temperature=0.8,
    context_window=4096,
    timeout=180.0,
    json_mode=True,
    keep_alive="5m",
    additional_kwargs={"top_p": 0.9, "top_k": 40}
)

3. With Custom Client#

Pre-configured Ollama client:

import os
from ollama import Client
from serapeum.ollama import Ollama

# Create custom client
client = Client(host="http://localhost:11434", timeout=300)

# Pass to Ollama
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    client=client,
)

4. JSON Mode for Structured Outputs#

Enable JSON formatting:

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    json_mode=True,  # Forces JSON output
    timeout=180,
)

Chat Operations#

1. Single Turn Chat#

Basic conversation:

import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

messages = [
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")])
]

response = llm.chat(messages)
print(response.message.content)  # "4"

2. Multi-turn Conversation#

With conversation history:

import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

messages = [
    Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are a helpful math tutor.")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What is 2+2?")]),
    Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="2+2 equals 4.")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What about 3+3?")]),
]

response = llm.chat(messages)
print(response.message.content)  # "3+3 equals 6."

3. Chat with Parameters#

Passing custom parameters:

import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Write a creative story.")])]

# Override default settings
response = llm.chat(
    messages,
    temperature=0.9,      # Higher for creativity
    top_p=0.95,
    max_tokens=500,
)

4. Chat with Images#

Multi-modal input (if supported by model):

import os
from pathlib import Path
from serapeum.core.llms import Message, MessageRole, Image, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3-vl:235b-instruct",     # Vision model
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

# Create message with image
image = Image(path=Path("docs/reference/providers/ollama/images/baharia-oasis.jpg"))

messages = [
    Message(
        role=MessageRole.USER,
        chunks=[TextChunk(content="What's in this image?")],
    ),
    Message(
        chunks=[image],
    )
]

response = llm.chat(messages)
print(response.message.content)

Completion Operations#

1. Basic Completion#

Simple text completion:

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

prompt = "The capital of France is"
response = llm.complete(prompt)
print(response.text)  # "Paris"

2. Completion with Parameters#

Custom generation settings:

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

response = llm.complete(
    "Once upon a time",
    temperature=0.8,
    max_tokens=200,
)
print(response.text)

3. JSON Completion#

Force JSON output:

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    json_mode=True,
    timeout=180,
)

prompt = 'Return {"name": "John", "age": 30} as JSON'
response = llm.complete(prompt)
print(response.text)  # {"name": "John", "age": 30}

Streaming Operations#

1. Stream Chat#

Real-time streaming chat:

import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Count from 1 to 5.")])]

# Stream responses
for chunk in llm.chat(messages, stream=True):
    print(chunk.message.content, end="", flush=True)
    # Outputs: "1" " 2" " 3" " 4" " 5"

2. Stream Completion#

Real-time streaming completion:

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

prompt = "Write a haiku about coding:"

# Stream completion
for chunk in llm.complete(prompt, stream=True):
    print(chunk.text, end="", flush=True)

3. Processing Stream with Delta#

Access incremental content:

import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Tell me a joke.")])]

full_response = ""
for chunk in llm.chat(messages, stream=True):
    delta = chunk.delta  # Incremental content
    if delta:
        full_response += delta
        print(delta, end="", flush=True)

print(f"\n\nFull response: {full_response}")

Tool/Function Calling#

  • For function calling we are going to use llama3.1 model from the local server, so make sure to pull the model first: ollama pull llama3.1, then restart the server.

1. Basic Tool Calling#

Using tools with Ollama:

```python notest import os from pydantic import BaseModel from serapeum.core.llms import Message, MessageRole, TextChunk from serapeum.core.tools import CallableTool from serapeum.ollama import Ollama

class Album(BaseModel): title: str artist: str songs: list[str]

def create_album(title: str, artist: str, songs: list[str]) -> Album: """Create an album with the given information.""" return Album(title=title, artist=artist, songs=songs)

llm = Ollama( model="llama3.1", timeout=180 )

Create tool from function#

tool = CallableTool.from_function(create_album)

message = Message( role=MessageRole.USER, chunks=[TextChunk(content="Create a rock album with two songs")])

Call with tools#

response = llm.generate_tool_calls(tools=[tool], message=message)

Extract tool calls#

tool_calls = llm.get_tool_calls_from_response(response) print(tool_calls)

### 2. Tool Calling from Pydantic Model

Create tools from Pydantic models:

```python
import os
from pydantic import BaseModel, Field
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.core.tools import CallableTool
from serapeum.ollama import Ollama


class Album(BaseModel):
    """An music album."""
    title: str = Field(description="Album title")
    artist: str = Field(description="Artist name")
    songs: list[str] = Field(description="List of song titles")


llm = Ollama(
    model="llama3.1",
    timeout=180
)

# Create tool from Pydantic model
tool = CallableTool.from_model(Album)

message = Message(
    role=MessageRole.USER,
    chunks=[TextChunk(content="Create a jazz album with title 'Blue Notes' by Miles Davis with 3 songs")])

response = llm.generate_tool_calls(tools=[tool], message=message)

# Extract and execute tool call
tool_calls = llm.get_tool_calls_from_response(response)
for tool_call in tool_calls:
    # Execute tool
    result = tool.call(**tool_call.tool_kwargs)
    print(result)  # Album instance

3. Single Tool Call Mode#

Force single tool call:

import os
from pydantic import BaseModel
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.core.tools import CallableTool
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


llm = Ollama(
    model="llama3.1",
    timeout=180
)

tool = CallableTool.from_model(Album)

message = Message(
    role=MessageRole.USER,
    chunks=[TextChunk(content="Create two albums")])

# Force single tool call
response = llm.generate_tool_calls(
    tools=[tool],
    message=message,
    allow_parallel_tool_calls=False,  # Only one tool call allowed
)

tool_calls = llm.get_tool_calls_from_response(response)
print(len(tool_calls))  # 1 (even if model tried to return multiple)

4. Parallel Tool Calls#

Allow multiple tool calls:

import os
from pydantic import BaseModel
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.core.tools import CallableTool
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


llm = Ollama(
    model="llama3.1",
    timeout=180
)

tool = CallableTool.from_model(Album)

message = Message(
    role=MessageRole.USER,
    chunks=[TextChunk(content="Create two albums: one rock album and one jazz album")])

# Allow parallel tool calls
response = llm.generate_tool_calls(
    tools=[tool],
    message=message,
    allow_parallel_tool_calls=True,
)

tool_calls = llm.get_tool_calls_from_response(response)
print(len(tool_calls))  # 2 (if model returns multiple)

for tool_call in tool_calls:
    result = tool.call(**tool_call.tool_kwargs)
    print(result)

5. Streaming with Tools#

Stream tool calls:

import os
from pydantic import BaseModel
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.core.tools import CallableTool
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


llm = Ollama(
    model="llama3.1",
    timeout=180
)

tool = CallableTool.from_model(Album)

message = Message(
    role=MessageRole.USER,
    chunks=[TextChunk(content="Create a pop album")])

# Stream with tools
for chunk in llm.generate_tool_calls(tools=[tool], message=message, stream=True):
    # Process streaming tool calls
    if chunk.message.additional_kwargs.get("tool_calls"):
        print(f"Tool call chunk: {chunk.message.additional_kwargs['tool_calls']}")

Integration with Orchestrators#

1. With TextCompletionLLM#

Use Ollama with TextCompletionLLM for structured outputs:

import os
from pydantic import BaseModel
from serapeum.core.output_parsers import PydanticParser
from serapeum.core.llms import TextCompletionLLM
from serapeum.ollama import Ollama


class DummyModel(BaseModel):
    value: str


# Initialize Ollama
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

# Create parser
parser = PydanticParser(output_cls=DummyModel)

# Create TextCompletionLLM
text_llm = TextCompletionLLM(
    output_parser=parser,
    prompt="Value: {value}",
    llm=llm,
)

# Execute
result = text_llm(value="input")
print(result.value)  # "input"

2. With ToolOrchestratingLLM#

Use Ollama with ToolOrchestratingLLM for tool-based workflows:

import os
from pydantic import BaseModel
from serapeum.core.llms import ToolOrchestratingLLM
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


# Initialize Ollama
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

# Create ToolOrchestratingLLM
tools_llm = ToolOrchestratingLLM(
    schema=Album,
    prompt="Create an album about {topic} with two random songs",
    llm=llm,
)

# Execute - returns Album instance
result = tools_llm(topic="rock")
print(result.title)
print(result.artist)
print(result.songs)

3. Parallel Tool Execution#

Using ToolOrchestratingLLM with parallel tools:

import os
from pydantic import BaseModel
from serapeum.core.llms import ToolOrchestratingLLM
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

# Enable parallel tool calls
tools_llm = ToolOrchestratingLLM(
    schema=Album,
    prompt="Create albums about {topic}",
    llm=llm,
    allow_parallel_tool_calls=True,
)

# Returns list of Album instances
results = tools_llm(topic="jazz")
print(len(results))  # Potentially multiple albums
for album in results:
    print(f"{album.title} by {album.artist}")

4. Streaming with ToolOrchestratingLLM#

Stream tool execution results:

import os
from pydantic import BaseModel
from serapeum.core.llms import ToolOrchestratingLLM
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

tools_llm = ToolOrchestratingLLM(
    schema=Album,
    prompt="Create albums about {topic}",
    llm=llm,
    allow_parallel_tool_calls=False,
)

# Stream results
for album in tools_llm(topic="rock", stream=True):
    print(f"Received: {album.title}")

Async Operations#

1. Async Chat#

Non-blocking chat:

import asyncio
import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama


async def async_chat_example():
    llm = Ollama(
        model="qwen3.5:397b",
        api_key=os.environ.get("OLLAMA_API_KEY"),
        timeout=180
    )

    messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])]

    response = await llm.achat(messages)
    print(response.message.content)


asyncio.run(async_chat_example())

2. Async Completion#

Non-blocking completion:

import asyncio
import os
from serapeum.ollama import Ollama


async def async_complete_example():
    llm = Ollama(
        model="qwen3.5:397b",
        api_key=os.environ.get("OLLAMA_API_KEY"),
        timeout=180
    )

    response = await llm.acomplete("Say hello")
    print(response.text)


asyncio.run(async_complete_example())

3. Async Streaming Chat#

Non-blocking streaming:

import asyncio
import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama


async def async_stream_example():
    llm = Ollama(
        model="qwen3.5:397b",
        api_key=os.environ.get("OLLAMA_API_KEY"),
        timeout=180
    )

    messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Count to 5")])]

    async for chunk in await llm.achat(messages, stream=True):
        print(chunk.message.content, end="", flush=True)


asyncio.run(async_stream_example())

4. Concurrent Async Requests#

Process multiple requests concurrently:

import asyncio
import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama


async def process_multiple():
    llm = Ollama(
        model="qwen3.5:397b",
        api_key=os.environ.get("OLLAMA_API_KEY"),
        timeout=180
    )

    prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]

    # Create tasks
    tasks = [
        llm.achat([Message(role=MessageRole.USER, chunks=[TextChunk(content=prompt)])])
        for prompt in prompts
    ]

    # Execute concurrently
    responses = await asyncio.gather(*tasks)

    for prompt, response in zip(prompts, responses):
        print(f"{prompt} -> {response.message.content}")


asyncio.run(process_multiple())

5. Async with ToolOrchestratingLLM#

Async tool orchestration:

import asyncio
import os
from pydantic import BaseModel
from serapeum.core.llms import ToolOrchestratingLLM
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


async def async_tool_example():
    llm = Ollama(
        model="qwen3.5:397b",
        api_key=os.environ.get("OLLAMA_API_KEY"),
        timeout=180
    )

    tools_llm = ToolOrchestratingLLM(
        schema=Album,
        prompt="Create an album about {topic}",
        llm=llm,
    )

    result = await tools_llm.acall(topic="pop")
    print(result.title)


asyncio.run(async_tool_example())

6. Async Streaming with Tools#

Async streaming tool execution:

import asyncio
import os
from pydantic import BaseModel
from serapeum.core.llms import ToolOrchestratingLLM
from serapeum.ollama import Ollama


class Album(BaseModel):
    title: str
    artist: str
    songs: list[str]


async def async_stream_tool_example():
    llm = Ollama(
        model="qwen3.5:397b",
        api_key=os.environ.get("OLLAMA_API_KEY"),
        timeout=180
    )

    tools_llm = ToolOrchestratingLLM(
        schema=Album,
        prompt="Create albums about {topic}",
        llm=llm,
        allow_parallel_tool_calls=False,
    )

    stream = await tools_llm.acall(topic="rock", stream=True)
    async for album in stream:
        print(f"Received: {album.title}")


asyncio.run(async_stream_tool_example())

Best Practices#

1. Reuse LLM Instances#

Create once, use many times:

```python notest import os from serapeum.ollama import Ollama from serapeum.core.llms import Message, MessageRole, TextChunk

✓ Good: Create once#

llm = Ollama( model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"), timeout=180 )

Reuse for multiple calls#

message_1 = Message(role=MessageRole.USER, chunks=[TextChunk(content="hi")]) message_2 = Message(role=MessageRole.USER, chunks=[TextChunk(content="How are you?")]) response1 = llm.chat([message_1]) response2 = llm.chat([message_2])

✗ Bad: Don't recreate for each call#

def process(messages): llm = Ollama( model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY") ) # Inefficient return llm.chat(messages)

### 2. Use Appropriate Timeout

Set timeout based on expected response time:

```python
import os
from serapeum.ollama import Ollama

# Short timeout for simple queries
quick_llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=30
)

# Longer timeout for complex queries
complex_llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=300
)

3. Handle Errors Gracefully#

Always handle potential errors:

import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

try:
    response = llm.chat([Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello")])])
except TimeoutError:
    print("Request timed out")
except ConnectionError:
    print("Could not connect to Ollama server")
except Exception as e:
    print(f"Unexpected error: {e}")

4. Use JSON Mode for Structured Outputs#

Enable when expecting JSON:

import os
from serapeum.ollama import Ollama

# Enable JSON mode
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    json_mode=True,
    timeout=180,
)

# LLM will always return valid JSON

5. Monitor Response Metadata#

Use metadata for monitoring:

import os
from serapeum.core.llms import Message, MessageRole, TextChunk
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=180
)

response = llm.chat([Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello")])])

# Access metadata
print(f"Model: {response.additional_kwargs.get('model')}")
print(f"Tokens: {response.additional_kwargs.get('eval_count')}")
print(f"Duration: {response.additional_kwargs.get('total_duration')}")

See Also#