Skip to content

Ollama Provider#

Local LLM inference for the Serapeum framework

The serapeum-ollama package provides complete Ollama backend support for Serapeum, enabling you to run powerful language models locally on your machine.

  • Local Inference


    Run models locally on your machine without external API dependencies

  • Privacy First


    All data stays on your machine. No internet connection required after model download

  • Free & Open Source


    No API costs. Use any Ollama-compatible model

  • Complete Features


    Chat, completion, streaming, tool calling, structured outputs, and embeddings


Features#

  • Chat & Completion: Full-featured LLM interface with multi-turn conversations
  • Streaming: Real-time token streaming for both chat and structured outputs
  • Tool Calling: Function calling with automatic schema generation
  • Structured Outputs: Type-safe extraction using Pydantic models
  • Embeddings: Local embedding generation for RAG and semantic search
  • Async Support: Full async/await support for all operations

Installation#

Install Serapeum-Ollama#

pip install serapeum-ollama

Install Ollama Server#

1. Download and Install Ollama

Visit ollama.com and follow the installation instructions for your platform:

  • macOS: Download the .app or use brew install ollama
  • Linux: curl -fsSL https://ollama.com/install.sh | sh
  • Windows: Download from ollama.com

2. Start the Ollama Server

ollama serve

The server runs on http://localhost:11434 by default.

3. Pull Models

Download the models you want to use:

# Chat/Completion models
ollama pull llama3.1          # Meta Llama 3.1 (recommended)
ollama pull llama3.2          # Latest Llama
ollama pull mistral           # Mistral 7B
ollama pull mixtral           # Mixtral 8x7B
ollama pull codellama         # Code-specialized
ollama pull gemma2            # Google Gemma 2

# Embedding models
ollama pull nomic-embed-text  # General embeddings (768d)
ollama pull mxbai-embed-large # High-quality (1024d)

4. Verify Installation

ollama list

You should see the models you've downloaded.


Quick Start#

import os
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole

# Initialize the LLM
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    temperature=0.7
)

# Simple chat
messages = [
    Message(role=MessageRole.USER, chunks=[TextChunk(content="Explain quantum computing in one sentence.")])
]
response = llm.chat(messages)
print(response.message.content)

Chat & Completion#

Basic Chat#

import os
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, MessageList

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    temperature=0.7,
    timeout=120
)

# Single message
response = llm.chat([
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What is the capital of France?")])
])
print(response.message.content)  # "The capital of France is Paris."

# Multi-turn conversation
conversation = [
    Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are a helpful assistant.")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 2+2?")]),
    Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="And if I add 3?")]),
]

response = llm.chat(MessageList.from_list(conversation))
print(response.message.content)  # "7"

# Access token usage
if hasattr(response.raw, 'usage'):
    print(f"Tokens used: {response.raw['usage']['total_tokens']}")

Completion Style#

Use prompt templates for completion-style interactions:

import os
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    temperature=0.8
)

# Simple template
prompt = PromptTemplate("Write a tagline for a company that makes {product}.")
response = llm.predict(prompt, product="eco-friendly water bottles")
print(response)

# Multi-variable template
prompt = PromptTemplate(
    "Write a {style} poem about {topic} in {lines} lines."
)
response = llm.predict(
    prompt,
    style="haiku",
    topic="artificial intelligence",
    lines="3"
)
print(response)

Streaming#

Stream responses token-by-token for real-time feedback:

Sync Streaming#

import os
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole

llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"))
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Write a haiku about coding.")])]

# Synchronous streaming
print("Streaming response: ", end="")
for chunk in llm.chat(messages, stream=True):
    print(chunk.delta, end="", flush=True)
print()

# Get the complete message from the last chunk
full_response = chunk.message.content

Async Streaming#

import asyncio
import os
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole

async def stream_example():
    llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"))
    messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Count to 5.")])]

    stream = await llm.achat(messages, stream=True)
    async for chunk in stream:
        print(chunk.delta, end="", flush=True)
    print()

asyncio.run(stream_example())

Structured Outputs#

Extract structured data using Pydantic models:

import os
from pydantic import BaseModel, Field
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate


class Person(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age in years")
    occupation: str = Field(description="Person's job title")

llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"), json_mode=True)

prompt = PromptTemplate(
    "Extract person information from: {text}\n"
    "Return a JSON object with name, age, and occupation."
)

# Synchronous structured prediction
person = llm.parse(
    schema=Person,
    prompt=prompt,
    text="John Doe is a 32-year-old software engineer at Tech Corp."
)

print(f"{person.name}, {person.age}, works as {person.occupation}")
# Output: John Doe, 32, works as software engineer

# Streaming structured outputs
for partial in llm.stream_parse(
        schema=Person,
        prompt=prompt,
        text="Jane Smith, age 28, data scientist"
):
    if isinstance(partial, list):
        partial = partial[0]
    print(f"Partial: {partial}")

# Async structured prediction
async def get_structured():
    person = await llm.aparse(
        schema=Person,
        prompt=prompt,
        text="Alice Johnson is 45 and works as a CEO."
    )
    return person

import asyncio


result = asyncio.run(get_structured())
print(result)

Tool Calling#

Create tools from functions or Pydantic models and let the LLM use them:

import os
from serapeum.ollama import Ollama
from serapeum.core.tools import CallableTool


def search_flights(origin: str, destination: str) -> dict:
    """Return estimated round-trip flight cost between two cities."""
    # Mock data
    prices = {
        ("london", "tokyo"): 850,
        ("new york", "paris"): 620,
        ("sydney", "dubai"): 540,
    }
    key = (origin.lower(), destination.lower())
    cost = prices.get(key, 700)
    return {"origin": origin, "destination": destination, "round_trip_cost_usd": cost}


def search_hotels(city: str, nights: int) -> dict:
    """Return estimated hotel cost for a stay in a city."""
    # Mock data — price per night
    per_night = {
        "tokyo": 180,
        "paris": 210,
        "dubai": 160,
    }
    rate = per_night.get(city.lower(), 150)
    return {"city": city, "nights": nights, "rate_per_night_usd": rate, "total_usd": rate * nights}


search_flight_tool = CallableTool.from_function(search_flights)
search_hotels_tool = CallableTool.from_function(search_hotels)

tools = [
    search_flight_tool,
    search_hotels_tool,
]

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=120,
)

response = llm.invoke_callable(
    tools=tools,
    user_msg="I'm planning a 7-night trip from London to Tokyo. What are the flight and hotel costs?",
    allow_parallel_tool_calls=True,
)
print(response)

Direct Tool Calling#

You can also use tools directly with the base LLM:

import os
from pydantic import BaseModel, Field
from serapeum.core.llms import Message, MessageRole
from serapeum.core.tools import CallableTool
from serapeum.ollama import Ollama


class CalculatorInput(BaseModel):
    """CalculatorInput data(operation, a, b)"""
    operation: str = Field(description="Math operation: add, subtract, multiply, divide")
    a: float = Field(description="First number")
    b: float = Field(description="Second number")

def calculate(operation: str, a: float, b: float) -> float:
    """Perform basic math operations."""
    ops = {
        "add": a + b,
        "subtract": a - b,
        "multiply": a * b,
        "divide": a / b if b != 0 else float('inf')
    }
    return ops.get(operation, 0)


calculator_tool = CallableTool.from_model(CalculatorInput)

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 25 + 17?")])]

llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=120,
)

response = llm.generate_tool_calls(
    tools=[calculator_tool],
    chat_history=messages,
)

# Check if model wants to call a tool
tool_calls = llm.get_tool_calls_from_response(response, error_on_no_tool_call=False)
if tool_calls:
    for call in tool_calls:
        print(f"Tool: {call.tool_name}")
        print(f"Arguments: {call.tool_kwargs}")

        # Execute the tool
        if call.tool_name == "calculate":
            result = calculate(**call.tool_kwargs)
            print(f"Result: {result}")

Embeddings#

Generate embeddings for RAG and semantic search:

Basic Embedding Generation#

  • The embedding API is only available in the local ollama server.
  • The embedding API is not available in the public cloud.
  • you need to install the ollama server locally. Run Ollama Server
from serapeum.ollama import OllamaEmbedding

# Initialize embedding model
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
)

# Generate single embedding
text_embedding = embed_model.get_text_embedding("Machine learning is fascinating.")
print(f"Embedding dimension: {len(text_embedding)}")
print(f"First 5 values: {text_embedding[:5]}")

# Query embedding (optimized for retrieval)
query_embedding = embed_model.get_query_embedding("What is machine learning?")

Batch Embeddings#

Generate embeddings for multiple texts efficiently:

from serapeum.ollama import OllamaEmbedding
import numpy as np

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    batch_size=32,  # Process 32 texts at a time
)

documents = [
    "Python is a high-level programming language.",
    "Machine learning enables computers to learn from data.",
    "Neural networks are inspired by biological neurons.",
    "Deep learning uses multi-layer neural networks.",
    "Natural language processing deals with text and speech.",
]

# Batch embedding generation
embeddings = embed_model.get_text_embedding_batch(documents)
print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {len(embeddings[0])} dimensions")

# Use with similarity search
query = "What is deep learning?"
query_emb = embed_model.get_query_embedding(query)

# Calculate cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [
    (doc, cosine_similarity(query_emb, emb))
    for doc, emb in zip(documents, embeddings)
]

# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
print("\nMost similar documents:")
for doc, score in similarities[:3]:
    print(f"  {score:.3f}: {doc}")

Async Embeddings#

import asyncio
from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(model_name="nomic-embed-text")

async def embed_documents():
    # Async single embedding
    embedding = await embed_model.aget_text_embedding("Hello, world!")
    print(f"Embedding generated: {len(embedding)} dimensions")

    # Async batch embeddings
    documents = [
        "Document 1 about AI",
        "Document 2 about ML",
        "Document 3 about NLP",
    ]
    text_embed = await embed_model.aget_text_embedding_batch(documents)
    print(f"Generated {len(text_embed)} embeddings asynchronously")

    return text_embed

asyncio.run(embed_documents())

Advanced Configuration#

from serapeum.ollama import OllamaEmbedding

# Configure with instructions for better retrieval
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
    batch_size=16,
    keep_alive="10m",  # Keep model loaded for 10 minutes
    query_instruction="Represent this query for retrieving relevant documents: ",
    text_instruction="Represent this document for retrieval: ",
)

# Instructions are automatically prepended
documents = "AI is transforming healthcare."
doc_embeddings = embed_model.get_text_embedding(documents)

query = "How is AI used in medicine?"
query_embedding = embed_model.get_query_embedding(query)

RAG Integration#

Combine LLM and embeddings for Retrieval-Augmented Generation:

import os
from serapeum.ollama import Ollama, OllamaEmbedding
from serapeum.core.llms import Message, MessageRole
import numpy as np

# Initialize both components
llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"))
embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Knowledge base
knowledge_base = [
    "The Eiffel Tower is in Paris, France.",
    "The Great Wall of China is in China.",
    "The Statue of Liberty is in New York, USA.",
]

# Generate embeddings for knowledge base
kb_embeddings = embed_model.get_text_embedding_batch(knowledge_base)

# User query
query = "Where is the Eiffel Tower?"
query_emb = embed_model.get_query_embedding(query)

# Similarity search
similarities = [
    (doc, embed_model.similarity(query_emb, emb))
    for doc, emb in zip(knowledge_base, kb_embeddings)
]
similarities.sort(key=lambda x: x[1], reverse=True)
context = similarities[0][0]

# Use LLM with retrieved context
messages = [
    Message(
        role=MessageRole.SYSTEM,
        chunks=[TextChunk(content=f"Answer based on this context: {context}")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content=query)])
]

response = llm.chat(messages)
print(response.message.content)
# Output: "The Eiffel Tower is in Paris, France."

Configuration#

LLM Configuration#

import os
from serapeum.ollama import Ollama

llm = Ollama(
    model="qwen3.5:397b",                    # Required: Ollama model name
    api_key=os.environ.get("OLLAMA_API_KEY"),
    base_url="https://api.ollama.com",   # Ollama server URL
    temperature=0.75,                    # Sampling temperature (0.0-1.0)
    context_window=3900,                 # Max context tokens
    timeout=60.0,                # Request timeout in seconds
    json_mode=False,                     # Enable JSON formatting
    is_function_calling_model=True,      # Whether model supports tools
    keep_alive="5m",                     # How long to keep model loaded
    additional_kwargs={                  # Provider-specific options
        "num_predict": 100,              # Max tokens to generate
        "top_k": 40,                     # Top-k sampling
        "top_p": 0.9,                    # Top-p (nucleus) sampling
        "repeat_penalty": 1.1,           # Repetition penalty
    }
)

Key Parameters:

  • model: Model identifier (e.g., "qwen3.5:397b", "mistral:latest")
  • base_url: Ollama cloud server endpoint (default: https://api.ollama.com)
  • temperature: Controls randomness (0.0 = deterministic, 1.0 = very random)
  • json_mode: Request JSON-formatted responses when True
  • timeout: Timeout for API calls (increase for slower models)
  • keep_alive: Duration to keep model in memory (e.g., "5m", "1h")
  • additional_kwargs: Pass any Ollama-specific options

Embedding Configuration#

from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",         # Required: embedding model name
    base_url="http://localhost:11434",     # Ollama server URL
    batch_size=10,                         # Batch size (1-2048)
    keep_alive="5m",                       # Model keep-alive duration
    query_instruction=None,                # Prefix for queries
    text_instruction=None,                 # Prefix for documents
    ollama_additional_kwargs={},           # Ollama API options
)

Available Models#

Chat/Completion Models#

Model Size Description Best For
llama3.1 8B-405B Meta's Llama 3.1 General purpose (recommended)
llama3.2 3B-90B Latest Llama model Latest features
mistral 7B Mistral 7B Fast inference
mixtral 8x7B Mixtral MoE High quality, efficient
codellama 7B-70B Code-specialized Llama Code generation
gemma2 9B-27B Google Gemma 2 Google's latest

Embedding Models#

Model Dimensions Description
nomic-embed-text 768 General-purpose embeddings
mxbai-embed-large 1024 High-quality embeddings
snowflake-arctic-embed 1024 Snowflake's model

Download models:

ollama pull llama3.1
ollama pull nomic-embed-text

List available models:

ollama list

Async Operations#

Full async support for concurrent operations:

import asyncio
import os
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole

async def main():
    llm = Ollama(model="qwen3.5:397b", api_key=os.environ.get("OLLAMA_API_KEY"))

    # Async chat
    response = await llm.achat([
        Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])
    ])
    print(response.message.content)

    # Async streaming
    messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Count to 5.")])]
    stream = await llm.achat(messages, stream=True)

    async for chunk in stream:
        print(chunk.delta, end="", flush=True)
    print()

asyncio.run(main())

Testing#

Run tests for the Ollama provider:

# All tests
cd libs/providers/ollama
uv run pytest

# Skip tests requiring Ollama server
uv run pytest -m "not e2e"

# Only unit tests
uv run pytest -m unit

# With coverage
uv run pytest --cov=serapeum.ollama

Note: End-to-end tests require a running Ollama server with models available.


Notes & Limitations#

  • Server Required: Ollama must be running (ollama serve) before using this provider
  • Tool Calling: Depends on model capabilities (llama3.1+ recommended for best results)
  • JSON Mode: Improves structured output quality when enabled
  • Timeouts: Increase timeout for larger models or complex tasks
  • Local Only: All inference happens on your machine
  • Model Availability: Only models you've downloaded with ollama pull are available

Troubleshooting#

Connection Issues#

# Check if Ollama is running
import requests
try:
    response = requests.get("http://localhost:11434/api/tags")
    print("Ollama is running")
    print(f"Available models: {response.json()}")
except Exception as e:
    print(f"Ollama is not running: {e}")

Model Not Found#

# List available models
ollama list

# Pull missing model
ollama pull llama3.1

Timeout Issues#

import os
from serapeum.ollama import Ollama
# Increase timeout for slower models
llm = Ollama(
    model="qwen3.5:397b",
    api_key=os.environ.get("OLLAMA_API_KEY"),
    timeout=300  # 5 minutes
)

Additional Resources#


Support#

For Ollama-specific issues: - Ollama GitHub Issues

For Serapeum integration issues: - Serapeum GitHub Issues