Core Embeddings #

`serapeum.core.embeddings` #

Embedding module.

`BaseEmbedding` #

Bases: SerializableModel, CallMixin, ABC

Abstract base class for all embedding model implementations.

This class provides the core interface and shared functionality for converting text into dense vector embeddings. It supports both query and document embedding, with optional caching, batching, and async operations.

Subclasses must implement the abstract methods for generating embeddings from text and queries. The class handles caching, batching, and progress tracking automatically.

Attributes:

Name	Type	Description
`model_name`	`str`	Name of the embedding model. Defaults to "unknown".
`batch_size`	`int`	Number of texts to process in each batch. Must be between 1 and 2048. Defaults to 10.
`num_workers`	`int \| None`	Number of worker threads for async operations. If None, uses default async behavior without worker pooling.
`cache_store`	`Any \| None`	Optional key-value store for caching embeddings. Must implement get(), aget(), put(), and aput() methods. When provided, embeddings are cached using a key combining text and model configuration.

Notes

This is an abstract base class and cannot be instantiated directly. Subclasses must implement _get_query_embedding, _aget_query_embedding, and _get_text_embedding methods.

See Also

serapeum.providers.ollama.embeddings.OllamaEmbedding: Concrete implementation for Ollama embedding models. CallMixin: Mixin providing call and acall methods. SerializableModel: Base Pydantic model with serialization support.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

class BaseEmbedding(SerializableModel, CallMixin, ABC):
    """Abstract base class for all embedding model implementations.

    This class provides the core interface and shared functionality for converting
    text into dense vector embeddings. It supports both query and document embedding,
    with optional caching, batching, and async operations.

    Subclasses must implement the abstract methods for generating embeddings from
    text and queries. The class handles caching, batching, and progress tracking
    automatically.

    Attributes:
        model_name: Name of the embedding model. Defaults to "unknown".
        batch_size: Number of texts to process in each batch. Must be between
            1 and 2048. Defaults to 10.
        num_workers: Number of worker threads for async operations. If None,
            uses default async behavior without worker pooling.
        cache_store: Optional key-value store for caching embeddings. Must implement
            get(), aget(), put(), and aput() methods. When provided, embeddings are
            cached using a key combining text and model configuration.

    Notes:
        This is an abstract base class and cannot be instantiated directly.
        Subclasses must implement _get_query_embedding, _aget_query_embedding,
        and _get_text_embedding methods.

    See Also:
        serapeum.providers.ollama.embeddings.OllamaEmbedding: Concrete implementation
            for Ollama embedding models.
        CallMixin: Mixin providing __call__ and acall methods.
        SerializableModel: Base Pydantic model with serialization support.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)
    model_name: str = Field(
        default="unknown", description="The name of the embedding model."
    )
    batch_size: int = Field(
        default=DEFAULT_EMBED_BATCH_SIZE,
        description="The batch size for embedding calls.",
        gt=0,
        le=2048,
    )
    num_workers: int | None = Field(
        default=None,
        description="The number of workers to use for async embedding calls.",
    )

    cache_store: Any | None = Field(
        default=None,
        description=(
            "Key-value store for caching embeddings. Must implement get(), aget(), "
            "put(), and aput() methods with signature: get(key: str, collection: str) -> dict | None. "
            "When provided, embeddings are cached using a key that combines the text and model configuration. "
            "If None, embeddings are not cached and will be recomputed on each call."
        ),
    )

    def _get_cache_key(self, text: str) -> str:
        """Generate a unique cache key combining text and model configuration.

        The cache key includes both the input text and a JSON representation of
        the model configuration, ensuring that different models or configurations
        don't share cached embeddings. Sensitive fields like api_key are excluded.

        Args:
            text: Input text to generate a cache key for.

        Returns:
            Cache key string in format "{text}::{model_config_json}".

        Notes:
            The following fields are excluded from the cache key:
                - api_key: Sensitive credential information
                - cache_store: Avoid circular reference in serialization
        """
        model_dict = self.to_dict()
        model_dict.pop("api_key", None)
        model_dict.pop("cache_store", None)  # Avoid circular reference

        # Create a deterministic string representation
        model_str = json.dumps(model_dict, sort_keys=True)
        return f"{text}::{model_str}"

    @abstractmethod
    def _get_query_embedding(self, query: str) -> Embedding:
        """Embed the input query synchronously (internal implementation).

        This is an internal method that subclasses must implement to provide
        the core query embedding functionality. The public method get_query_embedding()
        handles caching and calls this method when needed.

        Query embeddings may use special instructions or prefixes depending on the
        model. For example, some models prepend "Represent the question for
        retrieving supporting documents: " to optimize for retrieval tasks.

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_query_embedding: Public method that handles caching and delegates
                to this method.
            _aget_query_embedding: Async version of this method.
        """

    @abstractmethod
    async def _aget_query_embedding(self, query: str) -> Embedding:
        """Embed the input query asynchronously (internal implementation).

        This is an internal async method that subclasses must implement to provide
        the core query embedding functionality. The public method aget_query_embedding()
        handles caching and calls this method when needed.

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            aget_query_embedding: Public async method that handles caching and
                delegates to this method.
            _get_query_embedding: Sync version of this method.
        """

    def get_query_embedding(self, query: str) -> Embedding:
        """Generate an embedding vector for a query string.

        Embeds the input query into a dense vector representation optimized for
        retrieval tasks. When caching is enabled, checks the cache first and stores
        new embeddings automatically.

        Depending on the model, a special instruction may be prepended to the raw
        query string to optimize for specific tasks. For example, some models use
        "Represent the question for retrieving supporting documents: ".

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            aget_query_embedding: Async version of this method.
            get_text_embedding: For embedding document text (not queries).
            _get_query_embedding: Internal implementation method.
        """
        query_embedding = None
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            cached = self.cache_store.get(key=cache_key, collection="embeddings")
            if cached:
                cached_key = next(iter(cached.keys()))
                query_embedding = cached[cached_key]

        if query_embedding is None:
            query_embedding = self._get_query_embedding(query)
            if self.cache_store:
                cache_key = self._get_cache_key(query)
                self.cache_store.put(
                    key=cache_key,
                    val={str(uuid.uuid4()): query_embedding},
                    collection="embeddings",
                )

        return query_embedding

    async def aget_query_embedding(self, query: str) -> Embedding:
        """Asynchronously generate an embedding vector for a query string.

        Async version of get_query_embedding(). Embeds the input query into a dense
        vector representation with cache support.

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_query_embedding: Sync version of this method.
            aget_text_embedding: For embedding document text asynchronously.
            _aget_query_embedding: Internal async implementation method.
        """
        query_embedding = None
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            cached = await self.cache_store.aget(key=cache_key, collection="embeddings")
            if cached:
                cached_key = next(iter(cached.keys()))
                query_embedding = cached[cached_key]

        if query_embedding is None:
            query_embedding = await self._aget_query_embedding(query)
            if self.cache_store:
                cache_key = self._get_cache_key(query)
                await self.cache_store.aput(
                    key=cache_key,
                    val={str(uuid.uuid4()): query_embedding},
                    collection="embeddings",
                )

        return query_embedding

    def get_agg_embedding_from_queries(
        self,
        queries: list[str],
        agg_fn: Callable[..., Embedding] | None = None,
    ) -> Embedding:
        """Generate a single aggregated embedding from multiple query strings.

        Embeds each query individually and then combines them using an aggregation
        function. This is useful for creating a unified representation from multiple
        related queries or questions.

        Args:
            queries: List of query strings to embed and aggregate.
            agg_fn: Optional aggregation function that takes a list of embeddings
                and returns a single embedding. Defaults to mean_agg (arithmetic mean).

        Returns:
            Single aggregated embedding vector as a list of floats.

        See Also:
            aget_agg_embedding_from_queries: Async version of this method.
            mean_agg: Default aggregation function.
            get_query_embedding: Used internally to embed each query.
        """
        query_embeddings = [self.get_query_embedding(query) for query in queries]
        agg_fn = agg_fn or mean_agg
        return agg_fn(query_embeddings)

    async def aget_agg_embedding_from_queries(
        self,
        queries: list[str],
        agg_fn: Callable[..., Embedding] | None = None,
    ) -> Embedding:
        """Asynchronously generate an aggregated embedding from multiple queries.

        Async version of get_agg_embedding_from_queries(). Embeds each query
        asynchronously and then combines them using an aggregation function.

        Args:
            queries: List of query strings to embed and aggregate.
            agg_fn: Optional aggregation function that takes a list of embeddings
                and returns a single embedding. Defaults to mean_agg.

        Returns:
            Single aggregated embedding vector as a list of floats.

        See Also:
            get_agg_embedding_from_queries: Sync version of this method.
            aget_query_embedding: Used internally to embed each query.
            mean_agg: Default aggregation function.
        """
        query_embeddings = [await self.aget_query_embedding(query) for query in queries]
        agg_fn = agg_fn or mean_agg
        return agg_fn(query_embeddings)

    @abstractmethod
    def _get_text_embedding(self, text: str) -> Embedding:
        """Embed document text synchronously (internal implementation).

        This is an internal method that subclasses must implement to provide
        the core text embedding functionality. The public method get_text_embedding()
        handles caching and calls this method when needed.

        Text embeddings may use different instructions or prefixes than query
        embeddings. For example, some models prepend "Represent the document for
        retrieval: " to optimize for document representation.

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_text_embedding: Public method that handles caching and delegates
                to this method.
            _aget_text_embedding: Async version of this method.
            _get_query_embedding: For embedding queries (not documents).
        """

    async def _aget_text_embedding(self, text: str) -> Embedding:
        """Embed document text asynchronously (internal implementation).

        This is an internal async method that subclasses can override to provide
        true async text embedding. The default implementation falls back to the
        sync method _get_text_embedding().

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        Notes:
            Subclasses should override this method if they have a native async
            implementation. Otherwise, the default fallback to the sync method
            is used.

        See Also:
            aget_text_embedding: Public async method that handles caching.
            _get_text_embedding: Sync version used as fallback.
        """
        # Default implementation just falls back on _get_text_embedding
        return self._get_text_embedding(text)

    def _get_text_embeddings(self, texts: list[str]) -> list[Embedding]:
        """Embed multiple texts synchronously in batch (internal implementation).

        This internal method provides batch embedding functionality. Subclasses
        can override this method to provide more efficient batch processing if
        supported by the underlying model API.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        Notes:
            The default implementation simply loops over _get_text_embedding().
            Subclasses should override this if they can process batches more
            efficiently.

        See Also:
            get_text_embedding_batch: Public method for batch embedding with
                batching and progress tracking.
            _aget_text_embeddings: Async version of this method.
        """
        # Default implementation just loops over _get_text_embedding
        return [self._get_text_embedding(text) for text in texts]

    async def _aget_text_embeddings(self, texts: list[str]) -> list[Embedding]:
        """Embed multiple texts asynchronously in batch (internal implementation).

        This internal async method provides batch embedding functionality using
        asyncio.gather for concurrent processing. Subclasses can override this
        for more efficient batch processing.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        Notes:
            The default implementation uses asyncio.gather to process all texts
            concurrently via _aget_text_embedding(). Subclasses should override
            this if they can process batches more efficiently.

        See Also:
            aget_text_embedding_batch: Public async method for batch embedding.
            _get_text_embeddings: Sync version of this method.
        """
        return await asyncio.gather(
            *[self._aget_text_embedding(text) for text in texts]
        )

    def _get_text_embeddings_cached(self, texts: list[str]) -> list[Embedding]:
        """Retrieve text embeddings from cache or generate if not cached.

        This internal method checks the cache for each text and only generates
        embeddings for texts not found in the cache. Newly generated embeddings
        are automatically stored in the cache.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, preserving order.

        Raises:
            ValueError: If cache_store is None when this method is called.

        See Also:
            _aget_text_embeddings_cached: Async version of this method.
            get_text_embedding_batch: Public method that uses this for caching.
        """
        if self.cache_store is None:
            raise ValueError("embeddings_cache must be defined")

        embeddings: list[Embedding | None] = [None for i in range(len(texts))]
        # Tuples of (index, text) to be able to keep same order of embeddings
        non_cached_texts: list[tuple[int, str]] = []
        for i, txt in enumerate(texts):
            cache_key = self._get_cache_key(txt)
            cached_emb = self.cache_store.get(key=cache_key, collection="embeddings")
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                embeddings[i] = cached_emb[cached_key]
            else:
                non_cached_texts.append((i, txt))
        if len(non_cached_texts) > 0:
            text_embeddings = self._get_text_embeddings(
                [x[1] for x in non_cached_texts]
            )
            for j, text_embedding in enumerate(text_embeddings):
                orig_i = non_cached_texts[j][0]
                embeddings[orig_i] = text_embedding

                cache_key = self._get_cache_key(texts[orig_i])
                self.cache_store.put(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )
        return embeddings

    async def _aget_text_embeddings_cached(self, texts: list[str]) -> list[Embedding]:
        """Asynchronously retrieve text embeddings from cache or generate them.

        Async version of _get_text_embeddings_cached(). Checks the cache for each
        text and generates embeddings only for texts not found. Newly generated
        embeddings are automatically stored in the cache.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, preserving order.

        Raises:
            ValueError: If cache_store is None when this method is called.

        See Also:
            _get_text_embeddings_cached: Sync version of this method.
            aget_text_embedding_batch: Public async method that uses this.
        """
        if self.cache_store is None:
            raise ValueError("embeddings_cache must be defined")

        embeddings: list[Embedding | None] = [None for i in range(len(texts))]
        # Tuples of (index, text) to be able to keep same order of embeddings
        non_cached_texts: list[tuple[int, str]] = []
        for i, txt in enumerate(texts):
            cache_key = self._get_cache_key(txt)
            cached_emb = await self.cache_store.aget(
                key=cache_key, collection="embeddings"
            )
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                embeddings[i] = cached_emb[cached_key]
            else:
                non_cached_texts.append((i, txt))

        if len(non_cached_texts) > 0:
            text_embeddings = await self._aget_text_embeddings(
                [x[1] for x in non_cached_texts]
            )
            for j, text_embedding in enumerate(text_embeddings):
                orig_i = non_cached_texts[j][0]
                embeddings[orig_i] = text_embedding
                cache_key = self._get_cache_key(texts[orig_i])
                await self.cache_store.aput(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )
        return embeddings

    def get_text_embedding(self, text: str) -> Embedding:
        """Generate an embedding vector for document text.

        Embeds the input text into a dense vector representation optimized for
        document representation tasks. When caching is enabled, checks the cache
        first and stores new embeddings automatically.

        Depending on the model, a special instruction may be prepended to the raw
        text string to optimize for document retrieval. For example, some models
        use "Represent the document for retrieval: ".

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            aget_text_embedding: Async version of this method.
            get_query_embedding: For embedding queries (not documents).
            get_text_embedding_batch: For embedding multiple texts efficiently.
            _get_text_embedding: Internal implementation method.
        """
        if not self.cache_store:
            text_embedding = self._get_text_embedding(text)
        elif self.cache_store is not None:
            cache_key = self._get_cache_key(text)
            cached_emb = self.cache_store.get(key=cache_key, collection="embeddings")
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                text_embedding = cached_emb[cached_key]
            else:
                text_embedding = self._get_text_embedding(text)
                cache_key = self._get_cache_key(text)
                self.cache_store.put(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )

        return text_embedding

    async def aget_text_embedding(self, text: str) -> Embedding:
        """Asynchronously generate an embedding vector for document text.

        Async version of get_text_embedding(). Embeds the input text into a dense
        vector representation with cache support.

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_text_embedding: Sync version of this method.
            aget_query_embedding: For embedding queries asynchronously.
            aget_text_embedding_batch: For embedding multiple texts efficiently.
            _aget_text_embedding: Internal async implementation method.
        """
        if not self.cache_store:
            text_embedding = await self._aget_text_embedding(text)
        elif self.cache_store is not None:
            cache_key = self._get_cache_key(text)
            cached_emb = await self.cache_store.aget(
                key=cache_key, collection="embeddings"
            )
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                text_embedding = cached_emb[cached_key]
            else:
                text_embedding = await self._aget_text_embedding(text)
                cache_key = self._get_cache_key(text)
                await self.cache_store.aput(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )

        return text_embedding

    def get_text_embedding_batch(
        self,
        texts: list[str],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> list[Embedding]:
        """Generate embeddings for multiple texts with automatic batching.

        Processes a list of texts in batches according to self.batch_size. Supports
        optional progress tracking and automatic caching if cache_store is configured.

        Args:
            texts: List of document texts to embed.
            show_progress: Whether to display a progress bar. Defaults to False.
            **kwargs: Additional keyword arguments (reserved for future use).

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        See Also:
            aget_text_embedding_batch: Async version with parallel processing.
            get_text_embedding: For embedding a single text.
            _get_text_embeddings: Internal batch processing method.
            _get_text_embeddings_cached: Internal cached batch processing.
        """
        cur_batch: list[str] = []
        result_embeddings: list[Embedding] = []

        queue_with_progress = enumerate(
            get_tqdm_iterable(texts, show_progress, "Generating embeddings")
        )

        for idx, text in queue_with_progress:
            cur_batch.append(text)
            if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
                # flush
                if not self.cache_store:
                    embeddings = self._get_text_embeddings(cur_batch)
                elif self.cache_store is not None:
                    embeddings = self._get_text_embeddings_cached(cur_batch)

                result_embeddings.extend(embeddings)

                cur_batch = []

        return result_embeddings

    async def aget_text_embedding_batch(
        self,
        texts: list[str],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> list[Embedding]:
        """Asynchronously generate embeddings for multiple texts with batching.

        Async version of get_text_embedding_batch(). Processes texts in batches
        with concurrent execution for improved performance. Supports worker pooling
        if num_workers is set.

        Args:
            texts: List of document texts to embed.
            show_progress: Whether to display a progress bar. Defaults to False.
                Requires tqdm package for progress tracking.
            **kwargs: Additional keyword arguments (reserved for future use).

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        Notes:
            When num_workers > 1, uses worker pooling for concurrent batch processing.
            When show_progress=True, attempts to use tqdm.asyncio for progress tracking.

        See Also:
            get_text_embedding_batch: Sync version of this method.
            aget_text_embedding: For embedding a single text asynchronously.
            _aget_text_embeddings: Internal async batch processing method.
        """
        num_workers = self.num_workers

        cur_batch: list[str] = []
        embeddings_coroutines: list[Coroutine] = []

        # for idx, text in queue_with_progress:
        for idx, text in enumerate(texts):
            cur_batch.append(text)
            if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
                # flush

                if not self.cache_store:
                    embeddings_coroutines.append(self._aget_text_embeddings(cur_batch))
                elif self.cache_store is not None:
                    embeddings_coroutines.append(
                        self._aget_text_embeddings_cached(cur_batch)
                    )

                cur_batch = []

        # flatten the results of asyncio.gather, which is a list of embeddings lists
        if len(embeddings_coroutines) > 0:
            if num_workers and num_workers > 1:
                nested_embeddings = await run_jobs(
                    embeddings_coroutines,
                    show_progress=show_progress,
                    workers=self.num_workers,
                    desc="Generating embeddings",
                )
            elif show_progress:
                try:
                    nested_embeddings = await tqdm_asyncio.gather(
                        *embeddings_coroutines,
                        total=len(embeddings_coroutines),
                        desc="Generating embeddings",
                    )
                except ImportError:
                    nested_embeddings = await asyncio.gather(*embeddings_coroutines)
            else:
                nested_embeddings = await asyncio.gather(*embeddings_coroutines)
        else:
            nested_embeddings = []

        result_embeddings = [
            embedding for embeddings in nested_embeddings for embedding in embeddings
        ]
        return result_embeddings

    @staticmethod
    def similarity(
        embedding1: Embedding,
        embedding2: Embedding,
        mode: SimilarityMode = SimilarityMode.DEFAULT,
    ) -> float:
        """Calculate similarity between two embedding vectors.

        Static method wrapper for the module-level similarity() function. Provides
        a convenient way to compute similarity directly from the class.

        Args:
            embedding1: First embedding vector (list of floats).
            embedding2: Second embedding vector (list of floats).
            mode: Similarity computation mode. Defaults to cosine similarity.

        Returns:
            Similarity score as a float. Interpretation depends on the mode.

        Examples:
            - Computing cosine similarity
                ```python
                >>> from serapeum.core.embeddings import BaseEmbedding  # type: ignore
                >>> emb1 = [1.0, 0.0]
                >>> emb2 = [1.0, 0.0]
                >>> float(BaseEmbedding.similarity(emb1, emb2))
                1.0

                ```

            - Using different similarity modes
                ```python
                >>> emb1 = [3.0, 4.0]
                >>> emb2 = [3.0, 4.0]
                >>> float(BaseEmbedding.similarity(emb1, emb2, mode=SimilarityMode.DOT_PRODUCT))
                25.0

                ```

        See Also:
            similarity: Module-level function that performs the actual calculation.
            SimilarityMode: Enum defining available similarity modes.
        """
        return similarity(embedding1=embedding1, embedding2=embedding2, mode=mode)

    def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
        """Embed a sequence of nodes by calling the embedding model.

        This makes the embedding model callable, allowing it to be used as a function.
        Extracts text content from each node, generates embeddings, and assigns them
        back to the nodes.

        Args:
            nodes: Sequence of BaseNode objects to embed.
            **kwargs: Additional keyword arguments passed to get_text_embedding_batch.

        Returns:
            The input sequence of nodes with embeddings assigned to each node's
            embedding attribute.

        See Also:
            acall: Async version of this method.
            get_text_embedding_batch: Method used internally for batch embedding.
            MetadataMode.EMBED: Mode used to extract content from nodes.
        """
        embeddings = self.get_text_embedding_batch(
            [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
            **kwargs,
        )

        for node, embedding in zip(nodes, embeddings):
            node.embedding = embedding

        return nodes

    async def acall(
        self, nodes: Sequence[BaseNode], **kwargs: Any
    ) -> Sequence[BaseNode]:
        """Asynchronously embed a sequence of nodes.

        Async version of __call__(). Extracts text content from each node,
        generates embeddings asynchronously, and assigns them back to the nodes.

        Args:
            nodes: Sequence of BaseNode objects to embed.
            **kwargs: Additional keyword arguments passed to aget_text_embedding_batch.

        Returns:
            The input sequence of nodes with embeddings assigned to each node's
            embedding attribute.

        See Also:
            __call__: Sync version of this method.
            aget_text_embedding_batch: Method used internally for async batch embedding.
            MetadataMode.EMBED: Mode used to extract content from nodes.
        """
        embeddings = await self.aget_text_embedding_batch(
            [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
            **kwargs,
        )

        for node, embedding in zip(nodes, embeddings):
            node.embedding = embedding

        return nodes

`call(nodes, **kwargs)` #

Embed a sequence of nodes by calling the embedding model.

This makes the embedding model callable, allowing it to be used as a function. Extracts text content from each node, generates embeddings, and assigns them back to the nodes.

Parameters:

Name	Type	Description	Default
`nodes`	`Sequence[BaseNode]`	Sequence of BaseNode objects to embed.	required
`**kwargs`	`Any`	Additional keyword arguments passed to get_text_embedding_batch.	`{}`

Returns:

Type	Description
`Sequence[BaseNode]`	The input sequence of nodes with embeddings assigned to each node's
`Sequence[BaseNode]`	embedding attribute.

See Also

acall: Async version of this method. get_text_embedding_batch: Method used internally for batch embedding. MetadataMode.EMBED: Mode used to extract content from nodes.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
    """Embed a sequence of nodes by calling the embedding model.

    This makes the embedding model callable, allowing it to be used as a function.
    Extracts text content from each node, generates embeddings, and assigns them
    back to the nodes.

    Args:
        nodes: Sequence of BaseNode objects to embed.
        **kwargs: Additional keyword arguments passed to get_text_embedding_batch.

    Returns:
        The input sequence of nodes with embeddings assigned to each node's
        embedding attribute.

    See Also:
        acall: Async version of this method.
        get_text_embedding_batch: Method used internally for batch embedding.
        MetadataMode.EMBED: Mode used to extract content from nodes.
    """
    embeddings = self.get_text_embedding_batch(
        [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
        **kwargs,
    )

    for node, embedding in zip(nodes, embeddings):
        node.embedding = embedding

    return nodes

`acall(nodes, **kwargs)` `async` #

Asynchronously embed a sequence of nodes.

Async version of call(). Extracts text content from each node, generates embeddings asynchronously, and assigns them back to the nodes.

Parameters:

Name	Type	Description	Default
`nodes`	`Sequence[BaseNode]`	Sequence of BaseNode objects to embed.	required
`**kwargs`	`Any`	Additional keyword arguments passed to aget_text_embedding_batch.	`{}`

Returns:

Type	Description
`Sequence[BaseNode]`	The input sequence of nodes with embeddings assigned to each node's
`Sequence[BaseNode]`	embedding attribute.

See Also

call: Sync version of this method. aget_text_embedding_batch: Method used internally for async batch embedding. MetadataMode.EMBED: Mode used to extract content from nodes.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

async def acall(
    self, nodes: Sequence[BaseNode], **kwargs: Any
) -> Sequence[BaseNode]:
    """Asynchronously embed a sequence of nodes.

    Async version of __call__(). Extracts text content from each node,
    generates embeddings asynchronously, and assigns them back to the nodes.

    Args:
        nodes: Sequence of BaseNode objects to embed.
        **kwargs: Additional keyword arguments passed to aget_text_embedding_batch.

    Returns:
        The input sequence of nodes with embeddings assigned to each node's
        embedding attribute.

    See Also:
        __call__: Sync version of this method.
        aget_text_embedding_batch: Method used internally for async batch embedding.
        MetadataMode.EMBED: Mode used to extract content from nodes.
    """
    embeddings = await self.aget_text_embedding_batch(
        [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
        **kwargs,
    )

    for node, embedding in zip(nodes, embeddings):
        node.embedding = embedding

    return nodes

`aget_agg_embedding_from_queries(queries, agg_fn=None)` `async` #

Asynchronously generate an aggregated embedding from multiple queries.

Async version of get_agg_embedding_from_queries(). Embeds each query asynchronously and then combines them using an aggregation function.

Parameters:

Name	Type	Description	Default
`queries`	`list[str]`	List of query strings to embed and aggregate.	required
`agg_fn`	`Callable[..., Embedding] \| None`	Optional aggregation function that takes a list of embeddings and returns a single embedding. Defaults to mean_agg.	`None`

Returns:

Type	Description
`Embedding`	Single aggregated embedding vector as a list of floats.

See Also

get_agg_embedding_from_queries: Sync version of this method. aget_query_embedding: Used internally to embed each query. mean_agg: Default aggregation function.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

async def aget_agg_embedding_from_queries(
    self,
    queries: list[str],
    agg_fn: Callable[..., Embedding] | None = None,
) -> Embedding:
    """Asynchronously generate an aggregated embedding from multiple queries.

    Async version of get_agg_embedding_from_queries(). Embeds each query
    asynchronously and then combines them using an aggregation function.

    Args:
        queries: List of query strings to embed and aggregate.
        agg_fn: Optional aggregation function that takes a list of embeddings
            and returns a single embedding. Defaults to mean_agg.

    Returns:
        Single aggregated embedding vector as a list of floats.

    See Also:
        get_agg_embedding_from_queries: Sync version of this method.
        aget_query_embedding: Used internally to embed each query.
        mean_agg: Default aggregation function.
    """
    query_embeddings = [await self.aget_query_embedding(query) for query in queries]
    agg_fn = agg_fn or mean_agg
    return agg_fn(query_embeddings)

`aget_query_embedding(query)` `async` #

Asynchronously generate an embedding vector for a query string.

Async version of get_query_embedding(). Embeds the input query into a dense vector representation with cache support.

Parameters:

Name	Type	Description	Default
`query`	`str`	Query text to embed.	required

Returns:

Type	Description
`Embedding`	Embedding vector as a list of floats.

See Also

get_query_embedding: Sync version of this method. aget_text_embedding: For embedding document text asynchronously. _aget_query_embedding: Internal async implementation method.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

async def aget_query_embedding(self, query: str) -> Embedding:
    """Asynchronously generate an embedding vector for a query string.

    Async version of get_query_embedding(). Embeds the input query into a dense
    vector representation with cache support.

    Args:
        query: Query text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        get_query_embedding: Sync version of this method.
        aget_text_embedding: For embedding document text asynchronously.
        _aget_query_embedding: Internal async implementation method.
    """
    query_embedding = None
    if self.cache_store:
        cache_key = self._get_cache_key(query)
        cached = await self.cache_store.aget(key=cache_key, collection="embeddings")
        if cached:
            cached_key = next(iter(cached.keys()))
            query_embedding = cached[cached_key]

    if query_embedding is None:
        query_embedding = await self._aget_query_embedding(query)
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            await self.cache_store.aput(
                key=cache_key,
                val={str(uuid.uuid4()): query_embedding},
                collection="embeddings",
            )

    return query_embedding

`aget_text_embedding(text)` `async` #

Asynchronously generate an embedding vector for document text.

Async version of get_text_embedding(). Embeds the input text into a dense vector representation with cache support.

Parameters:

Name	Type	Description	Default
`text`	`str`	Document text to embed.	required

Returns:

Type	Description
`Embedding`	Embedding vector as a list of floats.

See Also

get_text_embedding: Sync version of this method. aget_query_embedding: For embedding queries asynchronously. aget_text_embedding_batch: For embedding multiple texts efficiently. _aget_text_embedding: Internal async implementation method.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

async def aget_text_embedding(self, text: str) -> Embedding:
    """Asynchronously generate an embedding vector for document text.

    Async version of get_text_embedding(). Embeds the input text into a dense
    vector representation with cache support.

    Args:
        text: Document text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        get_text_embedding: Sync version of this method.
        aget_query_embedding: For embedding queries asynchronously.
        aget_text_embedding_batch: For embedding multiple texts efficiently.
        _aget_text_embedding: Internal async implementation method.
    """
    if not self.cache_store:
        text_embedding = await self._aget_text_embedding(text)
    elif self.cache_store is not None:
        cache_key = self._get_cache_key(text)
        cached_emb = await self.cache_store.aget(
            key=cache_key, collection="embeddings"
        )
        if cached_emb is not None:
            cached_key = next(iter(cached_emb.keys()))
            text_embedding = cached_emb[cached_key]
        else:
            text_embedding = await self._aget_text_embedding(text)
            cache_key = self._get_cache_key(text)
            await self.cache_store.aput(
                key=cache_key,
                val={str(uuid.uuid4()): text_embedding},
                collection="embeddings",
            )

    return text_embedding

`aget_text_embedding_batch(texts, show_progress=False, **kwargs)` `async` #

Asynchronously generate embeddings for multiple texts with batching.

Async version of get_text_embedding_batch(). Processes texts in batches with concurrent execution for improved performance. Supports worker pooling if num_workers is set.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	List of document texts to embed.	required
`show_progress`	`bool`	Whether to display a progress bar. Defaults to False. Requires tqdm package for progress tracking.	`False`
`**kwargs`	`Any`	Additional keyword arguments (reserved for future use).	`{}`

Returns:

Type	Description
`list[Embedding]`	List of embedding vectors, one for each input text, in the same order.

Notes

When num_workers > 1, uses worker pooling for concurrent batch processing. When show_progress=True, attempts to use tqdm.asyncio for progress tracking.

See Also

get_text_embedding_batch: Sync version of this method. aget_text_embedding: For embedding a single text asynchronously. _aget_text_embeddings: Internal async batch processing method.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

async def aget_text_embedding_batch(
    self,
    texts: list[str],
    show_progress: bool = False,
    **kwargs: Any,
) -> list[Embedding]:
    """Asynchronously generate embeddings for multiple texts with batching.

    Async version of get_text_embedding_batch(). Processes texts in batches
    with concurrent execution for improved performance. Supports worker pooling
    if num_workers is set.

    Args:
        texts: List of document texts to embed.
        show_progress: Whether to display a progress bar. Defaults to False.
            Requires tqdm package for progress tracking.
        **kwargs: Additional keyword arguments (reserved for future use).

    Returns:
        List of embedding vectors, one for each input text, in the same order.

    Notes:
        When num_workers > 1, uses worker pooling for concurrent batch processing.
        When show_progress=True, attempts to use tqdm.asyncio for progress tracking.

    See Also:
        get_text_embedding_batch: Sync version of this method.
        aget_text_embedding: For embedding a single text asynchronously.
        _aget_text_embeddings: Internal async batch processing method.
    """
    num_workers = self.num_workers

    cur_batch: list[str] = []
    embeddings_coroutines: list[Coroutine] = []

    # for idx, text in queue_with_progress:
    for idx, text in enumerate(texts):
        cur_batch.append(text)
        if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
            # flush

            if not self.cache_store:
                embeddings_coroutines.append(self._aget_text_embeddings(cur_batch))
            elif self.cache_store is not None:
                embeddings_coroutines.append(
                    self._aget_text_embeddings_cached(cur_batch)
                )

            cur_batch = []

    # flatten the results of asyncio.gather, which is a list of embeddings lists
    if len(embeddings_coroutines) > 0:
        if num_workers and num_workers > 1:
            nested_embeddings = await run_jobs(
                embeddings_coroutines,
                show_progress=show_progress,
                workers=self.num_workers,
                desc="Generating embeddings",
            )
        elif show_progress:
            try:
                nested_embeddings = await tqdm_asyncio.gather(
                    *embeddings_coroutines,
                    total=len(embeddings_coroutines),
                    desc="Generating embeddings",
                )
            except ImportError:
                nested_embeddings = await asyncio.gather(*embeddings_coroutines)
        else:
            nested_embeddings = await asyncio.gather(*embeddings_coroutines)
    else:
        nested_embeddings = []

    result_embeddings = [
        embedding for embeddings in nested_embeddings for embedding in embeddings
    ]
    return result_embeddings

`get_agg_embedding_from_queries(queries, agg_fn=None)` #

Generate a single aggregated embedding from multiple query strings.

Embeds each query individually and then combines them using an aggregation function. This is useful for creating a unified representation from multiple related queries or questions.

Parameters:

Name	Type	Description	Default
`queries`	`list[str]`	List of query strings to embed and aggregate.	required
`agg_fn`	`Callable[..., Embedding] \| None`	Optional aggregation function that takes a list of embeddings and returns a single embedding. Defaults to mean_agg (arithmetic mean).	`None`

Returns:

Type	Description
`Embedding`	Single aggregated embedding vector as a list of floats.

See Also

aget_agg_embedding_from_queries: Async version of this method. mean_agg: Default aggregation function. get_query_embedding: Used internally to embed each query.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

def get_agg_embedding_from_queries(
    self,
    queries: list[str],
    agg_fn: Callable[..., Embedding] | None = None,
) -> Embedding:
    """Generate a single aggregated embedding from multiple query strings.

    Embeds each query individually and then combines them using an aggregation
    function. This is useful for creating a unified representation from multiple
    related queries or questions.

    Args:
        queries: List of query strings to embed and aggregate.
        agg_fn: Optional aggregation function that takes a list of embeddings
            and returns a single embedding. Defaults to mean_agg (arithmetic mean).

    Returns:
        Single aggregated embedding vector as a list of floats.

    See Also:
        aget_agg_embedding_from_queries: Async version of this method.
        mean_agg: Default aggregation function.
        get_query_embedding: Used internally to embed each query.
    """
    query_embeddings = [self.get_query_embedding(query) for query in queries]
    agg_fn = agg_fn or mean_agg
    return agg_fn(query_embeddings)

`get_query_embedding(query)` #

Generate an embedding vector for a query string.

Embeds the input query into a dense vector representation optimized for retrieval tasks. When caching is enabled, checks the cache first and stores new embeddings automatically.

Depending on the model, a special instruction may be prepended to the raw query string to optimize for specific tasks. For example, some models use "Represent the question for retrieving supporting documents: ".

Parameters:

Name	Type	Description	Default
`query`	`str`	Query text to embed.	required

Returns:

Type	Description
`Embedding`	Embedding vector as a list of floats.

See Also

aget_query_embedding: Async version of this method. get_text_embedding: For embedding document text (not queries). _get_query_embedding: Internal implementation method.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

def get_query_embedding(self, query: str) -> Embedding:
    """Generate an embedding vector for a query string.

    Embeds the input query into a dense vector representation optimized for
    retrieval tasks. When caching is enabled, checks the cache first and stores
    new embeddings automatically.

    Depending on the model, a special instruction may be prepended to the raw
    query string to optimize for specific tasks. For example, some models use
    "Represent the question for retrieving supporting documents: ".

    Args:
        query: Query text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        aget_query_embedding: Async version of this method.
        get_text_embedding: For embedding document text (not queries).
        _get_query_embedding: Internal implementation method.
    """
    query_embedding = None
    if self.cache_store:
        cache_key = self._get_cache_key(query)
        cached = self.cache_store.get(key=cache_key, collection="embeddings")
        if cached:
            cached_key = next(iter(cached.keys()))
            query_embedding = cached[cached_key]

    if query_embedding is None:
        query_embedding = self._get_query_embedding(query)
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            self.cache_store.put(
                key=cache_key,
                val={str(uuid.uuid4()): query_embedding},
                collection="embeddings",
            )

    return query_embedding

`get_text_embedding(text)` #

Generate an embedding vector for document text.

Embeds the input text into a dense vector representation optimized for document representation tasks. When caching is enabled, checks the cache first and stores new embeddings automatically.

Depending on the model, a special instruction may be prepended to the raw text string to optimize for document retrieval. For example, some models use "Represent the document for retrieval: ".

Parameters:

Name	Type	Description	Default
`text`	`str`	Document text to embed.	required

Returns:

Type	Description
`Embedding`	Embedding vector as a list of floats.

See Also

aget_text_embedding: Async version of this method. get_query_embedding: For embedding queries (not documents). get_text_embedding_batch: For embedding multiple texts efficiently. _get_text_embedding: Internal implementation method.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

def get_text_embedding(self, text: str) -> Embedding:
    """Generate an embedding vector for document text.

    Embeds the input text into a dense vector representation optimized for
    document representation tasks. When caching is enabled, checks the cache
    first and stores new embeddings automatically.

    Depending on the model, a special instruction may be prepended to the raw
    text string to optimize for document retrieval. For example, some models
    use "Represent the document for retrieval: ".

    Args:
        text: Document text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        aget_text_embedding: Async version of this method.
        get_query_embedding: For embedding queries (not documents).
        get_text_embedding_batch: For embedding multiple texts efficiently.
        _get_text_embedding: Internal implementation method.
    """
    if not self.cache_store:
        text_embedding = self._get_text_embedding(text)
    elif self.cache_store is not None:
        cache_key = self._get_cache_key(text)
        cached_emb = self.cache_store.get(key=cache_key, collection="embeddings")
        if cached_emb is not None:
            cached_key = next(iter(cached_emb.keys()))
            text_embedding = cached_emb[cached_key]
        else:
            text_embedding = self._get_text_embedding(text)
            cache_key = self._get_cache_key(text)
            self.cache_store.put(
                key=cache_key,
                val={str(uuid.uuid4()): text_embedding},
                collection="embeddings",
            )

    return text_embedding

`get_text_embedding_batch(texts, show_progress=False, **kwargs)` #

Generate embeddings for multiple texts with automatic batching.

Processes a list of texts in batches according to self.batch_size. Supports optional progress tracking and automatic caching if cache_store is configured.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	List of document texts to embed.	required
`show_progress`	`bool`	Whether to display a progress bar. Defaults to False.	`False`
`**kwargs`	`Any`	Additional keyword arguments (reserved for future use).	`{}`

Returns:

Type	Description
`list[Embedding]`	List of embedding vectors, one for each input text, in the same order.

See Also

aget_text_embedding_batch: Async version with parallel processing. get_text_embedding: For embedding a single text. _get_text_embeddings: Internal batch processing method. _get_text_embeddings_cached: Internal cached batch processing.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

def get_text_embedding_batch(
    self,
    texts: list[str],
    show_progress: bool = False,
    **kwargs: Any,
) -> list[Embedding]:
    """Generate embeddings for multiple texts with automatic batching.

    Processes a list of texts in batches according to self.batch_size. Supports
    optional progress tracking and automatic caching if cache_store is configured.

    Args:
        texts: List of document texts to embed.
        show_progress: Whether to display a progress bar. Defaults to False.
        **kwargs: Additional keyword arguments (reserved for future use).

    Returns:
        List of embedding vectors, one for each input text, in the same order.

    See Also:
        aget_text_embedding_batch: Async version with parallel processing.
        get_text_embedding: For embedding a single text.
        _get_text_embeddings: Internal batch processing method.
        _get_text_embeddings_cached: Internal cached batch processing.
    """
    cur_batch: list[str] = []
    result_embeddings: list[Embedding] = []

    queue_with_progress = enumerate(
        get_tqdm_iterable(texts, show_progress, "Generating embeddings")
    )

    for idx, text in queue_with_progress:
        cur_batch.append(text)
        if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
            # flush
            if not self.cache_store:
                embeddings = self._get_text_embeddings(cur_batch)
            elif self.cache_store is not None:
                embeddings = self._get_text_embeddings_cached(cur_batch)

            result_embeddings.extend(embeddings)

            cur_batch = []

    return result_embeddings

`similarity(embedding1, embedding2, mode=SimilarityMode.DEFAULT)` `staticmethod` #

Calculate similarity between two embedding vectors.

Static method wrapper for the module-level similarity() function. Provides a convenient way to compute similarity directly from the class.

Parameters:

Name	Type	Description	Default
`embedding1`	`Embedding`	First embedding vector (list of floats).	required
`embedding2`	`Embedding`	Second embedding vector (list of floats).	required
`mode`	`SimilarityMode`	Similarity computation mode. Defaults to cosine similarity.	`DEFAULT`

Returns:

Type	Description
`float`	Similarity score as a float. Interpretation depends on the mode.

Examples:

Computing cosine similarity

>>> from serapeum.core.embeddings import BaseEmbedding  # type: ignore
>>> emb1 = [1.0, 0.0]
>>> emb2 = [1.0, 0.0]
>>> float(BaseEmbedding.similarity(emb1, emb2))
1.0

Using different similarity modes

>>> emb1 = [3.0, 4.0]
>>> emb2 = [3.0, 4.0]
>>> float(BaseEmbedding.similarity(emb1, emb2, mode=SimilarityMode.DOT_PRODUCT))
25.0

See Also

similarity: Module-level function that performs the actual calculation. SimilarityMode: Enum defining available similarity modes.

Source code in libs/core/src/serapeum/core/base/embeddings/base.py

@staticmethod
def similarity(
    embedding1: Embedding,
    embedding2: Embedding,
    mode: SimilarityMode = SimilarityMode.DEFAULT,
) -> float:
    """Calculate similarity between two embedding vectors.

    Static method wrapper for the module-level similarity() function. Provides
    a convenient way to compute similarity directly from the class.

    Args:
        embedding1: First embedding vector (list of floats).
        embedding2: Second embedding vector (list of floats).
        mode: Similarity computation mode. Defaults to cosine similarity.

    Returns:
        Similarity score as a float. Interpretation depends on the mode.

    Examples:
        - Computing cosine similarity
            ```python
            >>> from serapeum.core.embeddings import BaseEmbedding  # type: ignore
            >>> emb1 = [1.0, 0.0]
            >>> emb2 = [1.0, 0.0]
            >>> float(BaseEmbedding.similarity(emb1, emb2))
            1.0

            ```

        - Using different similarity modes
            ```python
            >>> emb1 = [3.0, 4.0]
            >>> emb2 = [3.0, 4.0]
            >>> float(BaseEmbedding.similarity(emb1, emb2, mode=SimilarityMode.DOT_PRODUCT))
            25.0

            ```

    See Also:
        similarity: Module-level function that performs the actual calculation.
        SimilarityMode: Enum defining available similarity modes.
    """
    return similarity(embedding1=embedding1, embedding2=embedding2, mode=mode)

`BaseNode` #

Bases: SerializableModel, ABC

Abstract base class for document nodes with metadata and relationship management.

BaseNode provides the foundational functionality for representing chunks of documents with rich metadata, embeddings, and hierarchical relationships. It supports selective metadata inclusion for different contexts (LLM vs embeddings), automatic change detection via hashing, and efficient relationship caching.

Key features: - Automatic UUID generation for node identification - Metadata management with selective inclusion/exclusion for LLM and embedding contexts - Relationship tracking (source, parent, children, previous, next) - Embedding storage and retrieval - Cached LinkedNodes computation with automatic invalidation - Customizable metadata formatting and serialization

Attributes:

Name	Type	Description
`id`	`str`	Unique identifier for the node (auto-generated UUID if not provided).
`embedding`	`list[float] \| None`	Optional vector embedding for the node's content.
`metadata`	`dict[str, Any]`	Flat dictionary of metadata fields used for context and filtering.
`excluded_embed_metadata_keys`	`list[str]`	Metadata keys excluded from embedding context.
`excluded_llm_metadata_keys`	`list[str]`	Metadata keys excluded from LLM context.
`links`	`dict[Annotated[NodeType, EnumNameSerializer], NodeInfoType]`	Dictionary mapping NodeType to NodeInfo for relationships.
`metadata_template`	`str`	Template string for formatting metadata (default: "{key}: {value}").
`metadata_separator`	`str`	Separator between metadata fields (default: newline).

Note

This is an abstract base class. Subclasses must implement: - get_type(): Return the node's content type identifier - get_content(): Return the node's content with optional metadata - set_content(): Update the node's content - hash: Property returning the content hash for change detection

Examples:

Creating a concrete node subclass

>>> from serapeum.core.base.embeddings.types import BaseNode, MetadataMode, NodeType, NodeInfo
>>> import hashlib
>>> from pydantic import Field
>>>
>>> class TextNode(BaseNode):
...     text: str = Field(default="", description="Text content of the node")
...
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...
...     def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
...         metadata_str = self.get_metadata_str(mode=metadata_mode)
...         return f"{metadata_str}\\n{self.text}" if metadata_str else self.text
...
...     def set_content(self, value: str) -> None:
...         self.text = value
...
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()

Create a node with metadata

>>> node = TextNode(
...     text="Hello world",
...     metadata={"page": 1, "author": "Alice"}
... )
>>> node.get_type()
'text'
>>> node.get_content(metadata_mode=MetadataMode.NONE)
'Hello world'

Using metadata exclusion for different contexts

>>> node = TextNode(
...     text="Sensitive content",
...     metadata={"public": "yes", "internal_id": "secret123"},
...     excluded_llm_metadata_keys=["internal_id"]
... )

For LLM context (excludes internal_id)

>>> content_for_llm = node.get_content(metadata_mode=MetadataMode.LLM)
>>> "internal_id" in content_for_llm
False
>>> "public" in content_for_llm
True

Setting up node relationships

>>> parent = NodeInfo(id="parent-doc", type="document")
>>> child = NodeInfo(id="child-chunk", type="text")
>>>
>>> node = TextNode(
...     text="Child content",
...     links={NodeType.PARENT: parent, NodeType.SOURCE: parent}
... )
>>> node.linked_nodes.parent.id
'parent-doc'
>>> node.source_id
'parent-doc'

Working with embeddings

>>> node = TextNode(text="Sample text")
>>> node.embedding = [0.1, 0.2, 0.3, 0.4, 0.5]
>>> embedding_vec = node.get_embedding()
>>> len(embedding_vec)
5
>>> embedding_vec[0]
0.1

See Also

NodeInfo: Lightweight reference to a node. LinkedNodes: Container for node relationships. MetadataMode: Controls metadata inclusion in different contexts. SerializableModel: Base class providing serialization capabilities.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

class BaseNode(SerializableModel, ABC):
    r"""Abstract base class for document nodes with metadata and relationship management.

    BaseNode provides the foundational functionality for representing chunks of
    documents with rich metadata, embeddings, and hierarchical relationships. It
    supports selective metadata inclusion for different contexts (LLM vs embeddings),
    automatic change detection via hashing, and efficient relationship caching.

    Key features:
    - Automatic UUID generation for node identification
    - Metadata management with selective inclusion/exclusion for LLM and embedding contexts
    - Relationship tracking (source, parent, children, previous, next)
    - Embedding storage and retrieval
    - Cached LinkedNodes computation with automatic invalidation
    - Customizable metadata formatting and serialization

    Attributes:
        id: Unique identifier for the node (auto-generated UUID if not provided).
        embedding: Optional vector embedding for the node's content.
        metadata: Flat dictionary of metadata fields used for context and filtering.
        excluded_embed_metadata_keys: Metadata keys excluded from embedding context.
        excluded_llm_metadata_keys: Metadata keys excluded from LLM context.
        links: Dictionary mapping NodeType to NodeInfo for relationships.
        metadata_template: Template string for formatting metadata (default: "{key}: {value}").
        metadata_separator: Separator between metadata fields (default: newline).

    Note:
        This is an abstract base class. Subclasses must implement:
        - get_type(): Return the node's content type identifier
        - get_content(): Return the node's content with optional metadata
        - set_content(): Update the node's content
        - hash: Property returning the content hash for change detection

    Examples:
        - Creating a concrete node subclass
            ```python
            >>> from serapeum.core.base.embeddings.types import BaseNode, MetadataMode, NodeType, NodeInfo
            >>> import hashlib
            >>> from pydantic import Field
            >>>
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="", description="Text content of the node")
            ...
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...
            ...     def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
            ...         metadata_str = self.get_metadata_str(mode=metadata_mode)
            ...         return f"{metadata_str}\\n{self.text}" if metadata_str else self.text
            ...
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()

            ```
            - Create a node with metadata
            ```python
            >>> node = TextNode(
            ...     text="Hello world",
            ...     metadata={"page": 1, "author": "Alice"}
            ... )
            >>> node.get_type()
            'text'
            >>> node.get_content(metadata_mode=MetadataMode.NONE)
            'Hello world'

            ```
        - Using metadata exclusion for different contexts
            ```python
            >>> node = TextNode(
            ...     text="Sensitive content",
            ...     metadata={"public": "yes", "internal_id": "secret123"},
            ...     excluded_llm_metadata_keys=["internal_id"]
            ... )

            ```
            - For LLM context (excludes internal_id)
            ```python
            >>> content_for_llm = node.get_content(metadata_mode=MetadataMode.LLM)
            >>> "internal_id" in content_for_llm
            False
            >>> "public" in content_for_llm
            True

            ```
        - Setting up node relationships
            ```python
            >>> parent = NodeInfo(id="parent-doc", type="document")
            >>> child = NodeInfo(id="child-chunk", type="text")
            >>>
            >>> node = TextNode(
            ...     text="Child content",
            ...     links={NodeType.PARENT: parent, NodeType.SOURCE: parent}
            ... )
            >>> node.linked_nodes.parent.id
            'parent-doc'
            >>> node.source_id
            'parent-doc'

            ```
        - Working with embeddings
            ```python
            >>> node = TextNode(text="Sample text")
            >>> node.embedding = [0.1, 0.2, 0.3, 0.4, 0.5]
            >>> embedding_vec = node.get_embedding()
            >>> len(embedding_vec)
            5
            >>> embedding_vec[0]
            0.1

            ```

    See Also:
        NodeInfo: Lightweight reference to a node.
        LinkedNodes: Container for node relationships.
        MetadataMode: Controls metadata inclusion in different contexts.
        SerializableModel: Base class providing serialization capabilities.
    """

    # hash is computed on a local field, during the validation process
    model_config = ConfigDict(populate_by_name=True, validate_assignment=True)

    id: str = Field(
        default_factory=lambda: str(uuid.uuid4()), description="Unique ID of the node."
    )
    embedding: list[float] | None = Field(
        default=None, description="Embedding of the node."
    )

    metadata: dict[str, Any] = Field(
        default_factory=dict,
        description="A flat dictionary of metadata fields",
    )
    excluded_embed_metadata_keys: list[str] = Field(
        default_factory=list,
        description="Metadata keys that are excluded from text for the embed model.",
    )
    excluded_llm_metadata_keys: list[str] = Field(
        default_factory=list,
        description="Metadata keys that are excluded from text for the LLM.",
    )
    links: dict[
        Annotated[NodeType, EnumNameSerializer],
        NodeInfoType,
    ] = Field(
        default_factory=dict,
        description="A mapping of links to other nodes.",
    )
    metadata_template: str = Field(
        default=DEFAULT_METADATA_TMPL,
        description=(
            "Template for how metadata is formatted, with {key} and "
            "{value} placeholders."
        ),
    )
    metadata_separator: str = Field(
        default="\n",
        description="Separator between metadata fields when converting to string.",
    )

    linked_nodes_cache: LinkedNodes | None = Field(
        default=None,
        exclude=True,
        repr=False,
        description="Cached LinkedNodes object, invalidated when links change.",
    )

    # Track the links dict id to detect changes
    links_dict_id: int | None = Field(
        default=None,
        exclude=True,
        repr=False,
        description="ID of the links dict to detect when it's reassigned.",
    )

    @model_validator(mode="after")
    def _invalidate_linked_nodes_cache_on_links_change(self) -> "BaseNode":
        """Invalidate the linked_nodes cache when links dict is reassigned.

        This validator tracks the id of the links dict. When it changes
        (i.e., links is reassigned), the cache is cleared.

        Uses Pydantic v2's @model_validator with object.__setattr__ to avoid recursion.
        """
        current_links_id = id(self.links)

        # Check if links dict was reassigned (different id)
        if self.links_dict_id is None or self.links_dict_id != current_links_id:
            # Links changed, clear cache and update tracked id
            object.__setattr__(self, "linked_nodes_cache", None)
            object.__setattr__(self, "links_dict_id", current_links_id)

        return self

    @classmethod
    @abstractmethod
    def get_type(cls) -> str:
        """Get Object type."""

    @abstractmethod
    def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
        """Get object content."""

    def get_metadata_str(self, mode: MetadataMode = MetadataMode.ALL) -> str:
        """Metadata info string."""
        if mode == MetadataMode.NONE:
            return ""

        excluded = set()
        if mode == MetadataMode.LLM:
            excluded = set(self.excluded_llm_metadata_keys)
        elif mode == MetadataMode.EMBED:
            excluded = set(self.excluded_embed_metadata_keys)

        filtered = (
            self.metadata.items()
            if not excluded
            else (
                (key, value)
                for key, value in self.metadata.items()
                if key not in excluded
            )
        )
        return self.metadata_separator.join(
            self.metadata_template.format(key=key, value=str(value))
            for key, value in filtered
        )

    @abstractmethod
    def set_content(self, value: Any) -> None:
        """Set the content of the node."""

    @property
    @abstractmethod
    def hash(self) -> str:
        """Get hash of node."""

    @property
    def source_id(self) -> str | None:
        return self.linked_nodes.source_id

    def _clear_linked_nodes_cache(self) -> None:
        """Manually clear the linked_nodes cache.

        Call this method if you mutate the links dict in-place.
        This is necessary because Pydantic's field validators only trigger
        on field assignment, not on in-place mutations.

        Examples:
            >>> from serapeum.core.base.embeddings.types import BaseNode, NodeInfo, NodeType, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> node = TextNode(text="Sample", links={})
            >>> new_source = NodeInfo(id="updated-source", type="document")
            >>> node.links[NodeType.SOURCE] = new_source
            >>> node._clear_linked_nodes_cache()
            >>> node.linked_nodes.source.id
            'updated-source'
        """
        self.linked_nodes_cache = None

    @property
    def linked_nodes(self) -> LinkedNodes:
        """Get linked nodes from the links dictionary.

        This property validates and converts the links dictionary into a
        LinkedNodes object. The result is cached and automatically invalidated
        when the links field is reassigned through Pydantic's field validation.

        Returns:
            LinkedNodes: A validated and cached LinkedNodes object.

        Note:
            - Cache is automatically cleared when `links` is reassigned
            - For in-place mutations (e.g., node.links[key] = value), you must
              either reassign the entire dict OR call _clear_linked_nodes_cache()
            - Uses Pydantic's @field_validator to manage cache invalidation

        Examples:
            >>> from serapeum.core.base.embeddings.types import BaseNode, NodeInfo, NodeType, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> node = TextNode(text="Sample")
            >>> source_ref = NodeInfo(id="doc-123", type="document")
            >>> node.links = {NodeType.SOURCE: source_ref}
            >>> node.linked_nodes.source.id
            'doc-123'

            >>> node = TextNode(text="Sample", links={})
            >>> prev_ref = NodeInfo(id="prev-chunk", type="text")
            >>> node.links[NodeType.PREVIOUS] = prev_ref
            >>> node._clear_linked_nodes_cache()
            >>> node.linked_nodes.previous.id
            'prev-chunk'

            >>> node = TextNode(text="Sample")
            >>> parent = NodeInfo(id="parent-1", type="document")
            >>> child1 = NodeInfo(id="child-1", type="text")
            >>> child2 = NodeInfo(id="child-2", type="text")
            >>> node.links = {NodeType.PARENT: parent, NodeType.CHILD: [child1, child2]}
            >>> node.linked_nodes.parent.id
            'parent-1'
            >>> [c.id for c in node.linked_nodes.children]
            ['child-1', 'child-2']
        """
        if self.linked_nodes_cache is None:
            # Compute and cache the LinkedNodes
            self.linked_nodes_cache = LinkedNodes.create(self.links)
        return self.linked_nodes_cache

    def __str__(self) -> str:
        """STR."""
        source_text_truncated = truncate_text(
            self.get_content().strip(), TRUNCATE_LENGTH
        )
        source_text_wrapped = textwrap.fill(
            f"Text: {source_text_truncated}\n", width=WRAP_WIDTH
        )
        return f"Node ID: {self.id}\n{source_text_wrapped}"

    def get_embedding(self) -> list[float]:
        """Get embedding.

        Raises:
            ValueError: if embedding is None.
        """
        if self.embedding is None:
            raise ValueError("embedding not set.")
        return self.embedding

    def get_node_info(self) -> NodeInfo:
        """Get node info."""
        return NodeInfo(
            id=self.id,
            type=self.get_type(),
            metadata=self.metadata,
            hash=self.hash,
        )

`hash` `abstractmethod` `property` #

Get hash of node.

`linked_nodes` `property` #

Get linked nodes from the links dictionary.

This property validates and converts the links dictionary into a LinkedNodes object. The result is cached and automatically invalidated when the links field is reassigned through Pydantic's field validation.

Returns:

Name	Type	Description
`LinkedNodes`	`LinkedNodes`	A validated and cached LinkedNodes object.

Note

Cache is automatically cleared when links is reassigned
For in-place mutations (e.g., node.links[key] = value), you must either reassign the entire dict OR call _clear_linked_nodes_cache()
Uses Pydantic's @field_validator to manage cache invalidation

Examples:

>>> from serapeum.core.base.embeddings.types import BaseNode, NodeInfo, NodeType, MetadataMode
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> node = TextNode(text="Sample")
>>> source_ref = NodeInfo(id="doc-123", type="document")
>>> node.links = {NodeType.SOURCE: source_ref}
>>> node.linked_nodes.source.id
'doc-123'

>>> node = TextNode(text="Sample", links={})
>>> prev_ref = NodeInfo(id="prev-chunk", type="text")
>>> node.links[NodeType.PREVIOUS] = prev_ref
>>> node._clear_linked_nodes_cache()
>>> node.linked_nodes.previous.id
'prev-chunk'

>>> node = TextNode(text="Sample")
>>> parent = NodeInfo(id="parent-1", type="document")
>>> child1 = NodeInfo(id="child-1", type="text")
>>> child2 = NodeInfo(id="child-2", type="text")
>>> node.links = {NodeType.PARENT: parent, NodeType.CHILD: [child1, child2]}
>>> node.linked_nodes.parent.id
'parent-1'
>>> [c.id for c in node.linked_nodes.children]
['child-1', 'child-2']

`str()` #

STR.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

def __str__(self) -> str:
    """STR."""
    source_text_truncated = truncate_text(
        self.get_content().strip(), TRUNCATE_LENGTH
    )
    source_text_wrapped = textwrap.fill(
        f"Text: {source_text_truncated}\n", width=WRAP_WIDTH
    )
    return f"Node ID: {self.id}\n{source_text_wrapped}"

`get_content(metadata_mode=MetadataMode.ALL)` `abstractmethod` #

Get object content.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@abstractmethod
def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
    """Get object content."""

`get_embedding()` #

Get embedding.

Raises:

Type	Description
`ValueError`	if embedding is None.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

def get_embedding(self) -> list[float]:
    """Get embedding.

    Raises:
        ValueError: if embedding is None.
    """
    if self.embedding is None:
        raise ValueError("embedding not set.")
    return self.embedding

`get_metadata_str(mode=MetadataMode.ALL)` #

Metadata info string.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

def get_metadata_str(self, mode: MetadataMode = MetadataMode.ALL) -> str:
    """Metadata info string."""
    if mode == MetadataMode.NONE:
        return ""

    excluded = set()
    if mode == MetadataMode.LLM:
        excluded = set(self.excluded_llm_metadata_keys)
    elif mode == MetadataMode.EMBED:
        excluded = set(self.excluded_embed_metadata_keys)

    filtered = (
        self.metadata.items()
        if not excluded
        else (
            (key, value)
            for key, value in self.metadata.items()
            if key not in excluded
        )
    )
    return self.metadata_separator.join(
        self.metadata_template.format(key=key, value=str(value))
        for key, value in filtered
    )

`get_node_info()` #

Get node info.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

def get_node_info(self) -> NodeInfo:
    """Get node info."""
    return NodeInfo(
        id=self.id,
        type=self.get_type(),
        metadata=self.metadata,
        hash=self.hash,
    )

`get_type()` `abstractmethod` `classmethod` #

Get Object type.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@classmethod
@abstractmethod
def get_type(cls) -> str:
    """Get Object type."""

`set_content(value)` `abstractmethod` #

Set the content of the node.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@abstractmethod
def set_content(self, value: Any) -> None:
    """Set the content of the node."""

`CallMixin` #

Bases: ABC

Base class for node transformation components.

CallMixin defines the interface for components that transform sequences of nodes, such as embedders, parsers, or metadata enrichers. It provides both synchronous and asynchronous calling interfaces.

The mixin uses callable syntax (obj(nodes)) for synchronous transforms and obj.acall(nodes) for asynchronous transforms, enabling composable pipelines.

Attributes:

Name	Type	Description
`model_config`		Pydantic configuration allowing arbitrary types in subclasses.

Examples:

>>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
>>> from typing import Sequence, Any
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> class UppercaseTransform(CallMixin):
...     def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
...         result = []
...         for node in nodes:
...             node.set_content(node.get_content().upper())
...             result.append(node)
...         return result
>>> transformer = UppercaseTransform()
>>> nodes = [TextNode(text="hello"), TextNode(text="world")]
>>> transformed = transformer(nodes)
>>> transformed[0].get_content()
'HELLO'
>>> transformed[1].get_content()
'WORLD'

See Also

BaseEmbedding: Uses CallMixin to enable embedding nodes. BaseNode: The node type that this mixin transforms.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

class CallMixin(ABC):
    """Base class for node transformation components.

    CallMixin defines the interface for components that transform sequences of nodes,
    such as embedders, parsers, or metadata enrichers. It provides both synchronous
    and asynchronous calling interfaces.

    The mixin uses callable syntax (`obj(nodes)`) for synchronous transforms and
    `obj.acall(nodes)` for asynchronous transforms, enabling composable pipelines.

    Attributes:
        model_config: Pydantic configuration allowing arbitrary types in subclasses.

    Examples:
        >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
        >>> from typing import Sequence, Any
        >>> import hashlib
        >>> from pydantic import Field
        >>> class TextNode(BaseNode):
        ...     text: str = Field(default="")
        ...     @classmethod
        ...     def get_type(cls) -> str:
        ...         return "text"
        ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
        ...         return self.text
        ...     def set_content(self, value: str) -> None:
        ...         self.text = value
        ...     @property
        ...     def hash(self) -> str:
        ...         return hashlib.sha256(self.text.encode()).hexdigest()
        >>> class UppercaseTransform(CallMixin):
        ...     def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
        ...         result = []
        ...         for node in nodes:
        ...             node.set_content(node.get_content().upper())
        ...             result.append(node)
        ...         return result
        >>> transformer = UppercaseTransform()
        >>> nodes = [TextNode(text="hello"), TextNode(text="world")]
        >>> transformed = transformer(nodes)
        >>> transformed[0].get_content()
        'HELLO'
        >>> transformed[1].get_content()
        'WORLD'

    See Also:
        BaseEmbedding: Uses CallMixin to enable embedding nodes.
        BaseNode: The node type that this mixin transforms.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    @abstractmethod
    def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
        """Transform a sequence of nodes synchronously.

        Subclasses must implement this method to define their transformation logic.
        This method is called when the object is invoked directly: `obj(nodes)`.

        Args:
            nodes: Sequence of BaseNode instances to transform.
            **kwargs: Additional keyword arguments specific to the transformation.

        Returns:
            Transformed sequence of BaseNode instances.

        Examples:
            >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> class MetadataAdder(CallMixin):
            ...     def __call__(self, nodes, **kwargs):
            ...         result = []
            ...         for i, node in enumerate(nodes):
            ...             node.metadata["index"] = i
            ...             result.append(node)
            ...         return result
            >>> adder = MetadataAdder()
            >>> nodes = [TextNode(text="first"), TextNode(text="second")]
            >>> processed = adder(nodes)
            >>> processed[0].metadata["index"]
            0
            >>> processed[1].metadata["index"]
            1

        Note:
            Implementations should preserve node identity where possible and
            avoid mutating input nodes unless explicitly documented.
        """

    async def acall(
        self, nodes: Sequence[BaseNode], **kwargs: Any
    ) -> Sequence[BaseNode]:
        """Transform a sequence of nodes asynchronously.

        Default implementation delegates to synchronous `__call__`. Subclasses
        can override this for true async implementations (e.g., async API calls).

        Args:
            nodes: Sequence of BaseNode instances to transform.
            **kwargs: Additional keyword arguments specific to the transformation.

        Returns:
            Transformed sequence of BaseNode instances.

        Examples:
            >>> import asyncio
            >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> class AsyncTransform(CallMixin):
            ...     def __call__(self, nodes, **kwargs):
            ...         return nodes
            ...     async def acall(self, nodes, **kwargs):
            ...         await asyncio.sleep(0)
            ...         for node in nodes:
            ...             node.metadata["async_processed"] = True
            ...         return nodes
            >>> transform = AsyncTransform()
            >>> nodes = [TextNode(text="test")]
            >>> result = asyncio.run(transform.acall(nodes))
            >>> result[0].metadata["async_processed"]
            True

            >>> class SyncOnlyTransform(CallMixin):
            ...     def __call__(self, nodes, **kwargs):
            ...         for node in nodes:
            ...             node.metadata["processed"] = True
            ...         return nodes
            >>> sync_transform = SyncOnlyTransform()
            >>> nodes = [TextNode(text="test")]
            >>> result = asyncio.run(sync_transform.acall(nodes))
            >>> result[0].metadata["processed"]
            True

        Note:
            If no true async implementation is needed, the default delegation
            to `__call__` is sufficient. Override only if the transformation
            benefits from async/await (e.g., I/O operations).
        """
        return self.__call__(nodes, **kwargs)

`call(nodes, **kwargs)` `abstractmethod` #

Transform a sequence of nodes synchronously.

Subclasses must implement this method to define their transformation logic. This method is called when the object is invoked directly: obj(nodes).

Parameters:

Name	Type	Description	Default
`nodes`	`Sequence[BaseNode]`	Sequence of BaseNode instances to transform.	required
`**kwargs`	`Any`	Additional keyword arguments specific to the transformation.	`{}`

Returns:

Type	Description
`Sequence[BaseNode]`	Transformed sequence of BaseNode instances.

Examples:

>>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> class MetadataAdder(CallMixin):
...     def __call__(self, nodes, **kwargs):
...         result = []
...         for i, node in enumerate(nodes):
...             node.metadata["index"] = i
...             result.append(node)
...         return result
>>> adder = MetadataAdder()
>>> nodes = [TextNode(text="first"), TextNode(text="second")]
>>> processed = adder(nodes)
>>> processed[0].metadata["index"]
0
>>> processed[1].metadata["index"]
1

Note

Implementations should preserve node identity where possible and avoid mutating input nodes unless explicitly documented.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@abstractmethod
def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
    """Transform a sequence of nodes synchronously.

    Subclasses must implement this method to define their transformation logic.
    This method is called when the object is invoked directly: `obj(nodes)`.

    Args:
        nodes: Sequence of BaseNode instances to transform.
        **kwargs: Additional keyword arguments specific to the transformation.

    Returns:
        Transformed sequence of BaseNode instances.

    Examples:
        >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
        >>> import hashlib
        >>> from pydantic import Field
        >>> class TextNode(BaseNode):
        ...     text: str = Field(default="")
        ...     @classmethod
        ...     def get_type(cls) -> str:
        ...         return "text"
        ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
        ...         return self.text
        ...     def set_content(self, value: str) -> None:
        ...         self.text = value
        ...     @property
        ...     def hash(self) -> str:
        ...         return hashlib.sha256(self.text.encode()).hexdigest()
        >>> class MetadataAdder(CallMixin):
        ...     def __call__(self, nodes, **kwargs):
        ...         result = []
        ...         for i, node in enumerate(nodes):
        ...             node.metadata["index"] = i
        ...             result.append(node)
        ...         return result
        >>> adder = MetadataAdder()
        >>> nodes = [TextNode(text="first"), TextNode(text="second")]
        >>> processed = adder(nodes)
        >>> processed[0].metadata["index"]
        0
        >>> processed[1].metadata["index"]
        1

    Note:
        Implementations should preserve node identity where possible and
        avoid mutating input nodes unless explicitly documented.
    """

`acall(nodes, **kwargs)` `async` #

Transform a sequence of nodes asynchronously.

Default implementation delegates to synchronous __call__. Subclasses can override this for true async implementations (e.g., async API calls).

Parameters:

Name	Type	Description	Default
`nodes`	`Sequence[BaseNode]`	Sequence of BaseNode instances to transform.	required
`**kwargs`	`Any`	Additional keyword arguments specific to the transformation.	`{}`

Returns:

Type	Description
`Sequence[BaseNode]`	Transformed sequence of BaseNode instances.

Examples:

>>> import asyncio
>>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> class AsyncTransform(CallMixin):
...     def __call__(self, nodes, **kwargs):
...         return nodes
...     async def acall(self, nodes, **kwargs):
...         await asyncio.sleep(0)
...         for node in nodes:
...             node.metadata["async_processed"] = True
...         return nodes
>>> transform = AsyncTransform()
>>> nodes = [TextNode(text="test")]
>>> result = asyncio.run(transform.acall(nodes))
>>> result[0].metadata["async_processed"]
True

>>> class SyncOnlyTransform(CallMixin):
...     def __call__(self, nodes, **kwargs):
...         for node in nodes:
...             node.metadata["processed"] = True
...         return nodes
>>> sync_transform = SyncOnlyTransform()
>>> nodes = [TextNode(text="test")]
>>> result = asyncio.run(sync_transform.acall(nodes))
>>> result[0].metadata["processed"]
True

Note

If no true async implementation is needed, the default delegation to __call__ is sufficient. Override only if the transformation benefits from async/await (e.g., I/O operations).

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

async def acall(
    self, nodes: Sequence[BaseNode], **kwargs: Any
) -> Sequence[BaseNode]:
    """Transform a sequence of nodes asynchronously.

    Default implementation delegates to synchronous `__call__`. Subclasses
    can override this for true async implementations (e.g., async API calls).

    Args:
        nodes: Sequence of BaseNode instances to transform.
        **kwargs: Additional keyword arguments specific to the transformation.

    Returns:
        Transformed sequence of BaseNode instances.

    Examples:
        >>> import asyncio
        >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
        >>> import hashlib
        >>> from pydantic import Field
        >>> class TextNode(BaseNode):
        ...     text: str = Field(default="")
        ...     @classmethod
        ...     def get_type(cls) -> str:
        ...         return "text"
        ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
        ...         return self.text
        ...     def set_content(self, value: str) -> None:
        ...         self.text = value
        ...     @property
        ...     def hash(self) -> str:
        ...         return hashlib.sha256(self.text.encode()).hexdigest()
        >>> class AsyncTransform(CallMixin):
        ...     def __call__(self, nodes, **kwargs):
        ...         return nodes
        ...     async def acall(self, nodes, **kwargs):
        ...         await asyncio.sleep(0)
        ...         for node in nodes:
        ...             node.metadata["async_processed"] = True
        ...         return nodes
        >>> transform = AsyncTransform()
        >>> nodes = [TextNode(text="test")]
        >>> result = asyncio.run(transform.acall(nodes))
        >>> result[0].metadata["async_processed"]
        True

        >>> class SyncOnlyTransform(CallMixin):
        ...     def __call__(self, nodes, **kwargs):
        ...         for node in nodes:
        ...             node.metadata["processed"] = True
        ...         return nodes
        >>> sync_transform = SyncOnlyTransform()
        >>> nodes = [TextNode(text="test")]
        >>> result = asyncio.run(sync_transform.acall(nodes))
        >>> result[0].metadata["processed"]
        True

    Note:
        If no true async implementation is needed, the default delegation
        to `__call__` is sufficient. Override only if the transformation
        benefits from async/await (e.g., I/O operations).
    """
    return self.__call__(nodes, **kwargs)

`LinkedNodes` #

Bases: SerializableModel

Immutable container for node relationships in a document hierarchy.

LinkedNodes manages references between nodes in a document structure, supporting linear sequences (previous/next), hierarchical relationships (parent/children), and source document tracking. The model is frozen to prevent accidental mutation of relationship structures.

Attributes:

Name	Type	Description
`source`	`NodeInfo \| None`	Reference to the original source document node.
`previous`	`NodeInfo \| None`	Reference to the previous node in a sequence.
`next`	`NodeInfo \| None`	Reference to the next node in a sequence.
`parent`	`NodeInfo \| None`	Reference to the parent node in a hierarchy.
`children`	`list[NodeInfo] \| None`	List of child node references in a hierarchy.

Examples:

Creating a linear sequence of nodes

>>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
>>> prev_node = NodeInfo(id="chunk-1")
>>> next_node = NodeInfo(id="chunk-3")
>>> links = LinkedNodes(previous=prev_node, next=next_node)
>>> links.previous.id
'chunk-1'

Building hierarchical relationships

>>> parent = NodeInfo(id="section-1")
>>> child1 = NodeInfo(id="para-1")
>>> child2 = NodeInfo(id="para-2")
>>> links = LinkedNodes(parent=parent, children=[child1, child2])
>>> [c.id for c in links.children]
['para-1', 'para-2']

Using factory method with NodeType enum

>>> from serapeum.core.base.embeddings.types import NodeType
>>> source = NodeInfo(id="doc-main")
>>> links_dict = {NodeType.SOURCE: source}
>>> links = LinkedNodes.create(links_dict)
>>> links.source.id
'doc-main'

Accessing source ID property

>>> source = NodeInfo(id="original-doc")
>>> links = LinkedNodes(source=source)
>>> links.source_id
'original-doc'

See Also

NodeType: Enum defining relationship types. NodeInfo: References stored in relationship fields. BaseNode.linked_nodes: Property that creates LinkedNodes from links dict.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

class LinkedNodes(SerializableModel):
    """Immutable container for node relationships in a document hierarchy.

    LinkedNodes manages references between nodes in a document structure, supporting
    linear sequences (previous/next), hierarchical relationships (parent/children),
    and source document tracking. The model is frozen to prevent accidental mutation
    of relationship structures.

    Attributes:
        source: Reference to the original source document node.
        previous: Reference to the previous node in a sequence.
        next: Reference to the next node in a sequence.
        parent: Reference to the parent node in a hierarchy.
        children: List of child node references in a hierarchy.

    Examples:
        - Creating a linear sequence of nodes
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
            >>> prev_node = NodeInfo(id="chunk-1")
            >>> next_node = NodeInfo(id="chunk-3")
            >>> links = LinkedNodes(previous=prev_node, next=next_node)
            >>> links.previous.id
            'chunk-1'

            ```
        - Building hierarchical relationships
            ```python
            >>> parent = NodeInfo(id="section-1")
            >>> child1 = NodeInfo(id="para-1")
            >>> child2 = NodeInfo(id="para-2")
            >>> links = LinkedNodes(parent=parent, children=[child1, child2])
            >>> [c.id for c in links.children]
            ['para-1', 'para-2']

            ```
        - Using factory method with NodeType enum
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeType
            >>> source = NodeInfo(id="doc-main")
            >>> links_dict = {NodeType.SOURCE: source}
            >>> links = LinkedNodes.create(links_dict)
            >>> links.source.id
            'doc-main'

            ```
        - Accessing source ID property
            ```python
            >>> source = NodeInfo(id="original-doc")
            >>> links = LinkedNodes(source=source)
            >>> links.source_id
            'original-doc'

            ```

    See Also:
        NodeType: Enum defining relationship types.
        NodeInfo: References stored in relationship fields.
        BaseNode.linked_nodes: Property that creates LinkedNodes from links dict.
    """

    model_config = ConfigDict(frozen=True)

    source: NodeInfo | None = None
    previous: NodeInfo | None = None
    next: NodeInfo | None = None
    parent: NodeInfo | None = None
    children: list[NodeInfo] | None = None

    @field_validator("source", "previous", "next", "parent")
    @classmethod
    def validate_single_node(cls, v: Any) -> NodeInfo | None:
        """Validate that single-node fields contain NodeInfo objects.

        Ensures that source, previous, next, and parent fields contain exactly
        one NodeInfo instance (not a list). Called automatically by Pydantic
        during model instantiation and validation.

        Args:
            v: Value to validate, expected to be NodeInfo or None.

        Returns:
            The validated NodeInfo instance or None.

        Raises:
            ValueError: If v is not None and not a NodeInfo instance.

        Examples:
            - Valid single node assignment
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
                >>> node = NodeInfo(id="valid")
                >>> links = LinkedNodes(source=node)
                >>> links.source.id
                'valid'

                ```
            - Invalid list assignment to single-node field raises ValidationError
                ```python
                >>> from pydantic import ValidationError
                >>> try:
                ...     LinkedNodes(source=[NodeInfo(id="bad")])
                ... except ValidationError as e:
                ...     'source' in str(e)
                True

                ```

        Note:
            This validator applies to: source, previous, next, parent fields.
            The children field has a separate validator for list validation.
        """
        if v is not None and not isinstance(v, NodeInfo):
            raise ValueError("Must be a NodeInfo object, not a list")
        return v

    @field_validator("children")
    @classmethod
    def validate_children_list(cls, v: Any) -> list[NodeInfo] | None:
        """Validate that children field contains a list of NodeInfo objects.

        Ensures the children field is a list (not a single NodeInfo instance).
        Called automatically by Pydantic during model instantiation and validation.

        Args:
            v: Value to validate, expected to be list[NodeInfo] or None.

        Returns:
            The validated list of NodeInfo instances or None.

        Raises:
            ValueError: If v is not None and not a list.

        Examples:
            - Valid children list
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
                >>> child1 = NodeInfo(id="child-1")
                >>> child2 = NodeInfo(id="child-2")
                >>> links = LinkedNodes(children=[child1, child2])
                >>> [c.id for c in links.children]
                ['child-1', 'child-2']

                ```
            - Invalid single NodeInfo for children raises ValidationError
                ```python
                >>> from pydantic import ValidationError
                >>> try:
                ...     LinkedNodes(children=NodeInfo(id="bad"))
                ... except ValidationError as e:
                ...     'children' in str(e)
                True

                ```
            - Empty children list is valid
                ```python
                >>> links = LinkedNodes(children=[])
                >>> links.children
                []

                ```

        Note:
            This validator is specific to the children field, which represents
            one-to-many relationships.
        """
        if v is not None and not isinstance(v, list):
            raise ValueError("Children must be a list of NodeInfo objects")
        return v

    @classmethod
    def create(cls, linked_nodes_info: dict[NodeType, NodeInfoType]) -> "LinkedNodes":
        """Create LinkedNodes from a dict mapping NodeType to NodeInfo/list.

        Factory method that converts a dictionary with NodeType keys into a
        validated LinkedNodes instance. Pydantic validators automatically check
        that single-node fields contain NodeInfo and children contains a list.

        Args:
            linked_nodes_info: Dictionary mapping NodeType enum values to either
                NodeInfo (for single relationships) or list[NodeInfo] (for
                children). Missing keys are treated as None.

        Returns:
            A new LinkedNodes instance with validated relationships.

        Raises:
            ValueError: If a single-node field (SOURCE, PREVIOUS, NEXT, PARENT)
                receives a list, or if children receives a non-list value.

        Examples:
            - Creating from a dict with mixed relationships
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
                >>> source = NodeInfo(id="doc-1")
                >>> parent = NodeInfo(id="section-1")
                >>> children = [NodeInfo(id="para-1"), NodeInfo(id="para-2")]
                >>> links_dict = {
                ...     NodeType.SOURCE: source,
                ...     NodeType.PARENT: parent,
                ...     NodeType.CHILD: children
                ... }
                >>> links = LinkedNodes.create(links_dict)
                >>> links.source.id
                'doc-1'

                ```
            - Creating with only some relationships
                ```python
                >>> prev = NodeInfo(id="chunk-1")
                >>> next_node = NodeInfo(id="chunk-3")
                >>> links = LinkedNodes.create({
                ...     NodeType.PREVIOUS: prev,
                ...     NodeType.NEXT: next_node
                ... })
                >>> links.previous.id
                'chunk-1'

                ```
            - Empty dict creates all-None instance
                ```python
                >>> links = LinkedNodes.create({})
                >>> links.source is None
                True

                ```

        See Also:
            LinkedNodes.as_dict: Inverse operation converting LinkedNodes to dict.
            NodeType: Enum defining valid relationship types.
        """
        return cls(
            source=linked_nodes_info.get(NodeType.SOURCE),
            previous=linked_nodes_info.get(NodeType.PREVIOUS),
            next=linked_nodes_info.get(NodeType.NEXT),
            parent=linked_nodes_info.get(NodeType.PARENT),
            children=linked_nodes_info.get(NodeType.CHILD),
        )

    def as_dict(self) -> dict[NodeType, NodeInfoType | None]:
        """Convert LinkedNodes to a dictionary mapping NodeType to NodeInfo.

        Creates a dictionary representation with NodeType enum keys and NodeInfo
        values. None values are excluded from the result to create a compact
        representation containing only active relationships.

        Returns:
            Dictionary with NodeType keys and NodeInfo/list[NodeInfo] values.
            Only non-None relationships are included.

        Examples:
            - Converting to dict with multiple relationships
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
                >>> source = NodeInfo(id="doc-1")
                >>> parent = NodeInfo(id="section-1")
                >>> links = LinkedNodes(source=source, parent=parent)
                >>> result = links.as_dict()
                >>> result[NodeType.SOURCE].id
                'doc-1'

                ```
            - None values are excluded
                ```python
                >>> links = LinkedNodes(source=NodeInfo(id="doc-1"))
                >>> result = links.as_dict()
                >>> NodeType.PREVIOUS in result
                False

                ```
            - Round-trip with create method
                ```python
                >>> original = LinkedNodes(
                ...     source=NodeInfo(id="src"),
                ...     children=[NodeInfo(id="child-1")]
                ... )
                >>> as_dict = original.as_dict()
                >>> restored = LinkedNodes.create(as_dict)
                >>> restored.source.id
                'src'

                ```

        See Also:
            LinkedNodes.create: Factory method for creating from dict.
            BaseNode.links: Uses this format for storing relationships.
        """
        linked_nodes = {
            NodeType.SOURCE: self.source,
            NodeType.PREVIOUS: self.previous,
            NodeType.NEXT: self.next,
            NodeType.PARENT: self.parent,
            NodeType.CHILD: self.children,
        }

        linked_nodes = {
            key: value for key, value in linked_nodes.items() if value is not None
        }
        return linked_nodes

    @property
    def source_id(self) -> str | None:
        """Get the ID of the source node if it exists.

        Convenience property for accessing the source node's ID without
        checking if source is None first.

        Returns:
            The source node's ID string, or None if no source is set.

        Examples:
            - Accessing source ID when source exists
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
                >>> source = NodeInfo(id="document-123")
                >>> links = LinkedNodes(source=source)
                >>> links.source_id
                'document-123'

                ```
            - Accessing when source is None
                ```python
                >>> links = LinkedNodes()
                >>> links.source_id is None
                True

                ```

        See Also:
            BaseNode.source_id: Uses this property for node source tracking.
        """
        source_id = None
        if self.source is not None:
            source_id = self.source.id
        return source_id

`source_id` `property` #

Get the ID of the source node if it exists.

Convenience property for accessing the source node's ID without checking if source is None first.

Returns:

Type	Description
`str \| None`	The source node's ID string, or None if no source is set.

Examples:

Accessing source ID when source exists

>>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
>>> source = NodeInfo(id="document-123")
>>> links = LinkedNodes(source=source)
>>> links.source_id
'document-123'

Accessing when source is None

>>> links = LinkedNodes()
>>> links.source_id is None
True

See Also

BaseNode.source_id: Uses this property for node source tracking.

`as_dict()` #

Convert LinkedNodes to a dictionary mapping NodeType to NodeInfo.

Creates a dictionary representation with NodeType enum keys and NodeInfo values. None values are excluded from the result to create a compact representation containing only active relationships.

Returns:

Type	Description
`dict[NodeType, NodeInfoType \| None]`	Dictionary with NodeType keys and NodeInfo/list[NodeInfo] values.
`dict[NodeType, NodeInfoType \| None]`	Only non-None relationships are included.

Examples:

Converting to dict with multiple relationships

>>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
>>> source = NodeInfo(id="doc-1")
>>> parent = NodeInfo(id="section-1")
>>> links = LinkedNodes(source=source, parent=parent)
>>> result = links.as_dict()
>>> result[NodeType.SOURCE].id
'doc-1'

None values are excluded

>>> links = LinkedNodes(source=NodeInfo(id="doc-1"))
>>> result = links.as_dict()
>>> NodeType.PREVIOUS in result
False

Round-trip with create method

>>> original = LinkedNodes(
...     source=NodeInfo(id="src"),
...     children=[NodeInfo(id="child-1")]
... )
>>> as_dict = original.as_dict()
>>> restored = LinkedNodes.create(as_dict)
>>> restored.source.id
'src'

See Also

LinkedNodes.create: Factory method for creating from dict. BaseNode.links: Uses this format for storing relationships.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

def as_dict(self) -> dict[NodeType, NodeInfoType | None]:
    """Convert LinkedNodes to a dictionary mapping NodeType to NodeInfo.

    Creates a dictionary representation with NodeType enum keys and NodeInfo
    values. None values are excluded from the result to create a compact
    representation containing only active relationships.

    Returns:
        Dictionary with NodeType keys and NodeInfo/list[NodeInfo] values.
        Only non-None relationships are included.

    Examples:
        - Converting to dict with multiple relationships
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
            >>> source = NodeInfo(id="doc-1")
            >>> parent = NodeInfo(id="section-1")
            >>> links = LinkedNodes(source=source, parent=parent)
            >>> result = links.as_dict()
            >>> result[NodeType.SOURCE].id
            'doc-1'

            ```
        - None values are excluded
            ```python
            >>> links = LinkedNodes(source=NodeInfo(id="doc-1"))
            >>> result = links.as_dict()
            >>> NodeType.PREVIOUS in result
            False

            ```
        - Round-trip with create method
            ```python
            >>> original = LinkedNodes(
            ...     source=NodeInfo(id="src"),
            ...     children=[NodeInfo(id="child-1")]
            ... )
            >>> as_dict = original.as_dict()
            >>> restored = LinkedNodes.create(as_dict)
            >>> restored.source.id
            'src'

            ```

    See Also:
        LinkedNodes.create: Factory method for creating from dict.
        BaseNode.links: Uses this format for storing relationships.
    """
    linked_nodes = {
        NodeType.SOURCE: self.source,
        NodeType.PREVIOUS: self.previous,
        NodeType.NEXT: self.next,
        NodeType.PARENT: self.parent,
        NodeType.CHILD: self.children,
    }

    linked_nodes = {
        key: value for key, value in linked_nodes.items() if value is not None
    }
    return linked_nodes

`create(linked_nodes_info)` `classmethod` #

Create LinkedNodes from a dict mapping NodeType to NodeInfo/list.

Factory method that converts a dictionary with NodeType keys into a validated LinkedNodes instance. Pydantic validators automatically check that single-node fields contain NodeInfo and children contains a list.

Parameters:

Name	Type	Description	Default
`linked_nodes_info`	`dict[NodeType, NodeInfoType]`	Dictionary mapping NodeType enum values to either NodeInfo (for single relationships) or list[NodeInfo] (for children). Missing keys are treated as None.	required

Returns:

Type	Description
`'LinkedNodes'`	A new LinkedNodes instance with validated relationships.

Raises:

Type	Description
`ValueError`	If a single-node field (SOURCE, PREVIOUS, NEXT, PARENT) receives a list, or if children receives a non-list value.

Examples:

Creating from a dict with mixed relationships

>>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
>>> source = NodeInfo(id="doc-1")
>>> parent = NodeInfo(id="section-1")
>>> children = [NodeInfo(id="para-1"), NodeInfo(id="para-2")]
>>> links_dict = {
...     NodeType.SOURCE: source,
...     NodeType.PARENT: parent,
...     NodeType.CHILD: children
... }
>>> links = LinkedNodes.create(links_dict)
>>> links.source.id
'doc-1'

Creating with only some relationships

>>> prev = NodeInfo(id="chunk-1")
>>> next_node = NodeInfo(id="chunk-3")
>>> links = LinkedNodes.create({
...     NodeType.PREVIOUS: prev,
...     NodeType.NEXT: next_node
... })
>>> links.previous.id
'chunk-1'

Empty dict creates all-None instance

>>> links = LinkedNodes.create({})
>>> links.source is None
True

See Also

LinkedNodes.as_dict: Inverse operation converting LinkedNodes to dict. NodeType: Enum defining valid relationship types.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@classmethod
def create(cls, linked_nodes_info: dict[NodeType, NodeInfoType]) -> "LinkedNodes":
    """Create LinkedNodes from a dict mapping NodeType to NodeInfo/list.

    Factory method that converts a dictionary with NodeType keys into a
    validated LinkedNodes instance. Pydantic validators automatically check
    that single-node fields contain NodeInfo and children contains a list.

    Args:
        linked_nodes_info: Dictionary mapping NodeType enum values to either
            NodeInfo (for single relationships) or list[NodeInfo] (for
            children). Missing keys are treated as None.

    Returns:
        A new LinkedNodes instance with validated relationships.

    Raises:
        ValueError: If a single-node field (SOURCE, PREVIOUS, NEXT, PARENT)
            receives a list, or if children receives a non-list value.

    Examples:
        - Creating from a dict with mixed relationships
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
            >>> source = NodeInfo(id="doc-1")
            >>> parent = NodeInfo(id="section-1")
            >>> children = [NodeInfo(id="para-1"), NodeInfo(id="para-2")]
            >>> links_dict = {
            ...     NodeType.SOURCE: source,
            ...     NodeType.PARENT: parent,
            ...     NodeType.CHILD: children
            ... }
            >>> links = LinkedNodes.create(links_dict)
            >>> links.source.id
            'doc-1'

            ```
        - Creating with only some relationships
            ```python
            >>> prev = NodeInfo(id="chunk-1")
            >>> next_node = NodeInfo(id="chunk-3")
            >>> links = LinkedNodes.create({
            ...     NodeType.PREVIOUS: prev,
            ...     NodeType.NEXT: next_node
            ... })
            >>> links.previous.id
            'chunk-1'

            ```
        - Empty dict creates all-None instance
            ```python
            >>> links = LinkedNodes.create({})
            >>> links.source is None
            True

            ```

    See Also:
        LinkedNodes.as_dict: Inverse operation converting LinkedNodes to dict.
        NodeType: Enum defining valid relationship types.
    """
    return cls(
        source=linked_nodes_info.get(NodeType.SOURCE),
        previous=linked_nodes_info.get(NodeType.PREVIOUS),
        next=linked_nodes_info.get(NodeType.NEXT),
        parent=linked_nodes_info.get(NodeType.PARENT),
        children=linked_nodes_info.get(NodeType.CHILD),
    )

`validate_children_list(v)` `classmethod` #

Validate that children field contains a list of NodeInfo objects.

Ensures the children field is a list (not a single NodeInfo instance). Called automatically by Pydantic during model instantiation and validation.

Parameters:

Name	Type	Description	Default
`v`	`Any`	Value to validate, expected to be list[NodeInfo] or None.	required

Returns:

Type	Description
`list[NodeInfo] \| None`	The validated list of NodeInfo instances or None.

Raises:

Type	Description
`ValueError`	If v is not None and not a list.

Examples:

Valid children list

>>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
>>> child1 = NodeInfo(id="child-1")
>>> child2 = NodeInfo(id="child-2")
>>> links = LinkedNodes(children=[child1, child2])
>>> [c.id for c in links.children]
['child-1', 'child-2']

Invalid single NodeInfo for children raises ValidationError

>>> from pydantic import ValidationError
>>> try:
...     LinkedNodes(children=NodeInfo(id="bad"))
... except ValidationError as e:
...     'children' in str(e)
True

Empty children list is valid

>>> links = LinkedNodes(children=[])
>>> links.children
[]

Note

This validator is specific to the children field, which represents one-to-many relationships.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@field_validator("children")
@classmethod
def validate_children_list(cls, v: Any) -> list[NodeInfo] | None:
    """Validate that children field contains a list of NodeInfo objects.

    Ensures the children field is a list (not a single NodeInfo instance).
    Called automatically by Pydantic during model instantiation and validation.

    Args:
        v: Value to validate, expected to be list[NodeInfo] or None.

    Returns:
        The validated list of NodeInfo instances or None.

    Raises:
        ValueError: If v is not None and not a list.

    Examples:
        - Valid children list
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
            >>> child1 = NodeInfo(id="child-1")
            >>> child2 = NodeInfo(id="child-2")
            >>> links = LinkedNodes(children=[child1, child2])
            >>> [c.id for c in links.children]
            ['child-1', 'child-2']

            ```
        - Invalid single NodeInfo for children raises ValidationError
            ```python
            >>> from pydantic import ValidationError
            >>> try:
            ...     LinkedNodes(children=NodeInfo(id="bad"))
            ... except ValidationError as e:
            ...     'children' in str(e)
            True

            ```
        - Empty children list is valid
            ```python
            >>> links = LinkedNodes(children=[])
            >>> links.children
            []

            ```

    Note:
        This validator is specific to the children field, which represents
        one-to-many relationships.
    """
    if v is not None and not isinstance(v, list):
        raise ValueError("Children must be a list of NodeInfo objects")
    return v

`validate_single_node(v)` `classmethod` #

Validate that single-node fields contain NodeInfo objects.

Ensures that source, previous, next, and parent fields contain exactly one NodeInfo instance (not a list). Called automatically by Pydantic during model instantiation and validation.

Parameters:

Name	Type	Description	Default
`v`	`Any`	Value to validate, expected to be NodeInfo or None.	required

Returns:

Type	Description
`NodeInfo \| None`	The validated NodeInfo instance or None.

Raises:

Type	Description
`ValueError`	If v is not None and not a NodeInfo instance.

Examples:

Valid single node assignment

>>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
>>> node = NodeInfo(id="valid")
>>> links = LinkedNodes(source=node)
>>> links.source.id
'valid'

Invalid list assignment to single-node field raises ValidationError

>>> from pydantic import ValidationError
>>> try:
...     LinkedNodes(source=[NodeInfo(id="bad")])
... except ValidationError as e:
...     'source' in str(e)
True

Note

This validator applies to: source, previous, next, parent fields. The children field has a separate validator for list validation.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@field_validator("source", "previous", "next", "parent")
@classmethod
def validate_single_node(cls, v: Any) -> NodeInfo | None:
    """Validate that single-node fields contain NodeInfo objects.

    Ensures that source, previous, next, and parent fields contain exactly
    one NodeInfo instance (not a list). Called automatically by Pydantic
    during model instantiation and validation.

    Args:
        v: Value to validate, expected to be NodeInfo or None.

    Returns:
        The validated NodeInfo instance or None.

    Raises:
        ValueError: If v is not None and not a NodeInfo instance.

    Examples:
        - Valid single node assignment
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
            >>> node = NodeInfo(id="valid")
            >>> links = LinkedNodes(source=node)
            >>> links.source.id
            'valid'

            ```
        - Invalid list assignment to single-node field raises ValidationError
            ```python
            >>> from pydantic import ValidationError
            >>> try:
            ...     LinkedNodes(source=[NodeInfo(id="bad")])
            ... except ValidationError as e:
            ...     'source' in str(e)
            True

            ```

    Note:
        This validator applies to: source, previous, next, parent fields.
        The children field has a separate validator for list validation.
    """
    if v is not None and not isinstance(v, NodeInfo):
        raise ValueError("Must be a NodeInfo object, not a list")
    return v

`MetadataMode` #

Bases: str, Enum

Enumeration for controlling which metadata is included in different contexts.

Different use cases require different metadata visibility. For example, you might exclude certain metadata from embeddings (to avoid semantic pollution) while including it for LLM context (to provide additional information).

Attributes:

Name	Type	Description
`ALL`		Include all metadata fields.
`EMBED`		Include only metadata for embedding generation (excludes fields in excluded_embed_metadata_keys).
`LLM`		Include only metadata for LLM context (excludes fields in excluded_llm_metadata_keys).
`NONE`		Exclude all metadata.

Examples:

Filtering metadata for embeddings

>>> from serapeum.core.base.embeddings.types import MetadataMode
>>> mode = MetadataMode.EMBED
>>> mode.value
'embed'

Using with node content retrieval (conceptual)

>>> MetadataMode.LLM == "llm"
True
>>> MetadataMode.NONE == "none"
True

All mode values are plain strings
```
>>> MetadataMode.ALL.value
'all'
```

See Also

BaseNode.get_content: Uses this mode to control metadata inclusion. BaseNode.get_metadata_str: Filters metadata based on this mode. BaseNode.excluded_embed_metadata_keys: Metadata excluded for EMBED mode. BaseNode.excluded_llm_metadata_keys: Metadata excluded for LLM mode.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

class MetadataMode(str, Enum):
    """Enumeration for controlling which metadata is included in different contexts.

    Different use cases require different metadata visibility. For example, you
    might exclude certain metadata from embeddings (to avoid semantic pollution)
    while including it for LLM context (to provide additional information).

    Attributes:
        ALL: Include all metadata fields.
        EMBED: Include only metadata for embedding generation (excludes fields
            in excluded_embed_metadata_keys).
        LLM: Include only metadata for LLM context (excludes fields in
            excluded_llm_metadata_keys).
        NONE: Exclude all metadata.

    Examples:
        - Filtering metadata for embeddings
            ```python
            >>> from serapeum.core.base.embeddings.types import MetadataMode
            >>> mode = MetadataMode.EMBED
            >>> mode.value
            'embed'

            ```
        - Using with node content retrieval (conceptual)
            ```python
            >>> MetadataMode.LLM == "llm"
            True
            >>> MetadataMode.NONE == "none"
            True

            ```
        - All mode values are plain strings
            ```python
            >>> MetadataMode.ALL.value
            'all'

            ```

    See Also:
        BaseNode.get_content: Uses this mode to control metadata inclusion.
        BaseNode.get_metadata_str: Filters metadata based on this mode.
        BaseNode.excluded_embed_metadata_keys: Metadata excluded for EMBED mode.
        BaseNode.excluded_llm_metadata_keys: Metadata excluded for LLM mode.
    """

    ALL = "all"
    EMBED = "embed"
    LLM = "llm"
    NONE = "none"

`MockEmbedding` #

Bases: BaseEmbedding

Mock embedding model for testing purposes.

Returns constant embedding vectors (all 0.5 values) for any input, allowing tests to run without requiring a real embedding model. This is useful for unit testing, integration testing, and development without the overhead of loading actual models or making API calls.

All embeddings returned are deterministic vectors of the specified dimension, filled with 0.5 values. This makes tests reproducible and fast.

Attributes:

Name	Type	Description
`embed_dim`	`int`	Embedding dimension (must be positive).
`model_name`	`str`	Model name identifier (defaults to "mock-embedding").

Examples:

Creating a mock embedding model

>>> from serapeum.core.embeddings import MockEmbedding
>>> emb = MockEmbedding(embed_dim=3)
>>> emb.model_name
'mock-embedding'
>>> emb.embed_dim
3

Getting embeddings returns constant vectors

>>> emb = MockEmbedding(embed_dim=4)
>>> result = emb.get_text_embedding("any text")
>>> result
[0.5, 0.5, 0.5, 0.5]

All inputs produce identical constant vectors

>>> emb = MockEmbedding(embed_dim=2)
>>> emb.get_text_embedding("hello")
[0.5, 0.5]

Validation of embed_dim

>>> MockEmbedding(embed_dim=0)  # doctest: +ELLIPSIS
Traceback (most recent call last):
    ...
pydantic_core._pydantic_core.ValidationError: 1 validation error...

See Also

BaseEmbedding: Abstract base class that MockEmbedding implements.

Source code in libs/core/src/serapeum/core/embeddings/types.py

class MockEmbedding(BaseEmbedding):
    """Mock embedding model for testing purposes.

    Returns constant embedding vectors (all 0.5 values) for any input,
    allowing tests to run without requiring a real embedding model. This is
    useful for unit testing, integration testing, and development without
    the overhead of loading actual models or making API calls.

    All embeddings returned are deterministic vectors of the specified dimension,
    filled with 0.5 values. This makes tests reproducible and fast.

    Attributes:
        embed_dim: Embedding dimension (must be positive).
        model_name: Model name identifier (defaults to "mock-embedding").

    Examples:
        - Creating a mock embedding model
            ```python
            >>> from serapeum.core.embeddings import MockEmbedding
            >>> emb = MockEmbedding(embed_dim=3)
            >>> emb.model_name
            'mock-embedding'
            >>> emb.embed_dim
            3

            ```

        - Getting embeddings returns constant vectors
            ```python
            >>> emb = MockEmbedding(embed_dim=4)
            >>> result = emb.get_text_embedding("any text")
            >>> result
            [0.5, 0.5, 0.5, 0.5]

            ```

        - All inputs produce identical constant vectors
            ```python
            >>> emb = MockEmbedding(embed_dim=2)
            >>> emb.get_text_embedding("hello")
            [0.5, 0.5]

            ```

        - Validation of embed_dim
            ```python
            >>> MockEmbedding(embed_dim=0)  # doctest: +ELLIPSIS
            Traceback (most recent call last):
                ...
            pydantic_core._pydantic_core.ValidationError: 1 validation error...

            ```

    See Also:
        BaseEmbedding: Abstract base class that MockEmbedding implements.
    """

    embed_dim: int = Field(
        ..., gt=0, description="Embedding dimension (must be positive)"
    )
    model_name: str = Field(
        default="mock-embedding", description="Model name identifier"
    )

    @field_validator("embed_dim")
    @classmethod
    def validate_embed_dim(cls, v: int) -> int:
        """Validate that embed_dim is positive.

        Args:
            v: The embed_dim value to validate.

        Returns:
            The validated embed_dim.

        Raises:
            ValueError: If embed_dim is not positive.
        """
        if v <= 0:
            raise ValueError(f"embed_dim must be positive, got {v}")
        return v

    @classmethod
    def class_name(cls) -> str:
        """Return the class name identifier.

        Returns:
            String "MockEmbedding" identifying this class.

        Examples:
            - Getting the class name
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> MockEmbedding.class_name()
                'MockEmbedding'

                ```
        """
        return "MockEmbedding"

    def _get_mocked_vector(self) -> Embedding:
        """Generate a mock embedding vector of constant values.

        Creates a vector of length embed_dim where all values are 0.5. This is
        the core method that all other embedding methods delegate to.

        Returns:
            List of floats with length equal to embed_dim, all values 0.5.

        Examples:
            - Generating a mock vector
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=5)
                >>> emb._get_mocked_vector()
                [0.5, 0.5, 0.5, 0.5, 0.5]

                ```

            - Vector values are all 0.5
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> emb._get_mocked_vector()
                [0.5, 0.5, 0.5]

                ```
        """
        return [0.5] * self.embed_dim

    def _get_query_embedding(self, query: str) -> Embedding:
        """Get query embedding (returns constant mock vector).

        This method ignores the input query and always returns the same mock
        vector. Implements the abstract method from BaseEmbedding.

        Args:
            query: Query text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Query embedding returns mock vector
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> emb._get_query_embedding("test query")
                [0.5, 0.5, 0.5]

                ```

            - Different queries return the same constant vector
                ```python
                >>> emb = MockEmbedding(embed_dim=2)
                >>> emb._get_query_embedding("query1")
                [0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

    def _get_text_embedding(self, text: str) -> Embedding:
        """Get text embedding (returns constant mock vector).

        This method ignores the input text and always returns the same mock
        vector. Implements the abstract method from BaseEmbedding.

        Args:
            text: Input text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Text embedding returns mock vector
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=4)
                >>> emb._get_text_embedding("sample text")
                [0.5, 0.5, 0.5, 0.5]

                ```

            - Different texts return the same constant vector
                ```python
                >>> emb = MockEmbedding(embed_dim=2)
                >>> emb._get_text_embedding("text1")
                [0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

    async def _aget_query_embedding(self, query: str) -> Embedding:
        """Asynchronously get query embedding (returns constant mock vector).

        Async version of _get_query_embedding. This method ignores the input
        query and always returns the same mock vector. Implements the abstract
        method from BaseEmbedding.

        Args:
            query: Query text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Async query embedding
                ```python
                >>> import asyncio
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> asyncio.run(emb._aget_query_embedding("async query"))
                [0.5, 0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

    async def _aget_text_embedding(self, text: str) -> Embedding:
        """Asynchronously get text embedding (returns constant mock vector).

        Async version of _get_text_embedding. This method ignores the input
        text and always returns the same mock vector. Implements the abstract
        method from BaseEmbedding.

        Args:
            text: Input text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Async text embedding
                ```python
                >>> import asyncio
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> asyncio.run(emb._aget_text_embedding("async text"))
                [0.5, 0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

`class_name()` `classmethod` #

Return the class name identifier.

Returns:

Type	Description
`str`	String "MockEmbedding" identifying this class.

Examples:

Getting the class name

>>> from serapeum.core.embeddings import MockEmbedding
>>> MockEmbedding.class_name()
'MockEmbedding'

Source code in libs/core/src/serapeum/core/embeddings/types.py

@classmethod
def class_name(cls) -> str:
    """Return the class name identifier.

    Returns:
        String "MockEmbedding" identifying this class.

    Examples:
        - Getting the class name
            ```python
            >>> from serapeum.core.embeddings import MockEmbedding
            >>> MockEmbedding.class_name()
            'MockEmbedding'

            ```
    """
    return "MockEmbedding"

`validate_embed_dim(v)` `classmethod` #

Validate that embed_dim is positive.

Parameters:

Name	Type	Description	Default
`v`	`int`	The embed_dim value to validate.	required

Returns:

Type	Description
`int`	The validated embed_dim.

Raises:

Type	Description
`ValueError`	If embed_dim is not positive.

Source code in libs/core/src/serapeum/core/embeddings/types.py

@field_validator("embed_dim")
@classmethod
def validate_embed_dim(cls, v: int) -> int:
    """Validate that embed_dim is positive.

    Args:
        v: The embed_dim value to validate.

    Returns:
        The validated embed_dim.

    Raises:
        ValueError: If embed_dim is not positive.
    """
    if v <= 0:
        raise ValueError(f"embed_dim must be positive, got {v}")
    return v

`NodeContentType` #

Bases: str, Enum

Enumeration of content types that can be stored in a node.

This enum classifies the type of content a node contains, which helps downstream components (LLMs, embeddings, parsers) handle the content appropriately. String-based enum values enable direct serialization.

Attributes:

Name	Type	Description
`TEXT`		Plain text content, the most common node type.
`IMAGE`		Image data or references to images.
`INDEX`		Index structures or metadata about other nodes.
`DOCUMENT`		Complete document content before chunking.
`MULTIMODAL`		Content combining multiple modalities (text + images).

Examples:

Checking content type

>>> from serapeum.core.base.embeddings.types import NodeContentType
>>> content_type = NodeContentType.TEXT
>>> content_type.value
'text'

Using in node metadata

>>> from serapeum.core.base.embeddings.types import NodeInfo
>>> node = NodeInfo(id="node-1", type=NodeContentType.IMAGE)
>>> node.type
<NodeContentType.IMAGE: 'image'>

String comparison

>>> NodeContentType.TEXT == "text"
True

See Also

NodeInfo: Uses this enum to specify node content type. BaseNode.get_type: Abstract method returning content type string.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

class NodeContentType(str, Enum):
    """Enumeration of content types that can be stored in a node.

    This enum classifies the type of content a node contains, which helps
    downstream components (LLMs, embeddings, parsers) handle the content
    appropriately. String-based enum values enable direct serialization.

    Attributes:
        TEXT: Plain text content, the most common node type.
        IMAGE: Image data or references to images.
        INDEX: Index structures or metadata about other nodes.
        DOCUMENT: Complete document content before chunking.
        MULTIMODAL: Content combining multiple modalities (text + images).

    Examples:
        - Checking content type
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeContentType
            >>> content_type = NodeContentType.TEXT
            >>> content_type.value
            'text'

            ```
        - Using in node metadata
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeInfo
            >>> node = NodeInfo(id="node-1", type=NodeContentType.IMAGE)
            >>> node.type
            <NodeContentType.IMAGE: 'image'>

            ```
        - String comparison
            ```python
            >>> NodeContentType.TEXT == "text"
            True

            ```

    See Also:
        NodeInfo: Uses this enum to specify node content type.
        BaseNode.get_type: Abstract method returning content type string.
    """

    TEXT = "text"
    IMAGE = "image"
    INDEX = "index"
    DOCUMENT = "document"
    MULTIMODAL = "multimodal"

`NodeInfo` #

Bases: SerializableModel

Lightweight reference to a node with essential identification metadata.

NodeInfo provides a compact representation of a node without its full content, useful for creating references and relationships between nodes. It includes the node's ID, content type, metadata, and optional hash for change detection.

Attributes:

Name	Type	Description
`id`	`str`	Unique identifier for the node.
`type`	`Annotated[NodeContentType, EnumNameSerializer] \| str \| None`	Content type classification (NodeContentType enum or string).
`metadata`	`dict[str, Any]`	Arbitrary metadata dictionary for the node.
`hash`	`str \| None`	Optional hash value for detecting content changes.

Examples:

Creating a basic node reference

>>> from serapeum.core.base.embeddings.types import NodeInfo, NodeContentType
>>> ref = NodeInfo(
...     id="doc-456",
...     type=NodeContentType.TEXT,
...     metadata={"page": 1}
... )
>>> ref.id
'doc-456'

Serialization and deserialization

>>> ref = NodeInfo(id="node-1", type=NodeContentType.DOCUMENT)
>>> json_str = ref.to_json()
>>> restored = NodeInfo.from_json(json_str)
>>> restored.id
'node-1'

Using with hash for change detection

>>> import hashlib
>>> content = "Sample text"
>>> content_hash = hashlib.sha256(content.encode()).hexdigest()
>>> ref = NodeInfo(id="node-2", hash=content_hash)
>>> ref.hash[:8]
'3a2c5c49'

See Also

BaseNode: Full node implementation that generates NodeInfo. LinkedNodes: Container for node relationships using NodeInfo. SerializableModel: Base class providing serialization methods.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

class NodeInfo(SerializableModel):
    """Lightweight reference to a node with essential identification metadata.

    NodeInfo provides a compact representation of a node without its full content,
    useful for creating references and relationships between nodes. It includes
    the node's ID, content type, metadata, and optional hash for change detection.

    Attributes:
        id: Unique identifier for the node.
        type: Content type classification (NodeContentType enum or string).
        metadata: Arbitrary metadata dictionary for the node.
        hash: Optional hash value for detecting content changes.

    Examples:
        - Creating a basic node reference
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeInfo, NodeContentType
            >>> ref = NodeInfo(
            ...     id="doc-456",
            ...     type=NodeContentType.TEXT,
            ...     metadata={"page": 1}
            ... )
            >>> ref.id
            'doc-456'

            ```
        - Serialization and deserialization
            ```python
            >>> ref = NodeInfo(id="node-1", type=NodeContentType.DOCUMENT)
            >>> json_str = ref.to_json()
            >>> restored = NodeInfo.from_json(json_str)
            >>> restored.id
            'node-1'

            ```
        - Using with hash for change detection
            ```python
            >>> import hashlib
            >>> content = "Sample text"
            >>> content_hash = hashlib.sha256(content.encode()).hexdigest()
            >>> ref = NodeInfo(id="node-2", hash=content_hash)
            >>> ref.hash[:8]
            '3a2c5c49'

            ```

    See Also:
        BaseNode: Full node implementation that generates NodeInfo.
        LinkedNodes: Container for node relationships using NodeInfo.
        SerializableModel: Base class providing serialization methods.
    """

    id: str
    type: Annotated[NodeContentType, EnumNameSerializer] | str | None = None
    metadata: dict[str, Any] = Field(default_factory=dict)
    hash: str | None = None

    @classmethod
    def class_name(cls) -> str:
        """Return the class name identifier for serialization.

        Returns:
            Always returns "NodeInfo" as the stable class identifier.

        Examples:
            - Getting class name
                ```python
                >>> from serapeum.core.base.embeddings.types import NodeInfo
                >>> NodeInfo.class_name()
                'NodeInfo'

                ```
        """
        return "NodeInfo"

`class_name()` `classmethod` #

Return the class name identifier for serialization.

Returns:

Type	Description
`str`	Always returns "NodeInfo" as the stable class identifier.

Examples:

Getting class name

>>> from serapeum.core.base.embeddings.types import NodeInfo
>>> NodeInfo.class_name()
'NodeInfo'

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

@classmethod
def class_name(cls) -> str:
    """Return the class name identifier for serialization.

    Returns:
        Always returns "NodeInfo" as the stable class identifier.

    Examples:
        - Getting class name
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeInfo
            >>> NodeInfo.class_name()
            'NodeInfo'

            ```
    """
    return "NodeInfo"

`NodeType` #

Bases: str, Enum

Node links used in BaseNode class.

Attributes:

Name	Type	Description
`SOURCE`		The node is the source document.
`PREVIOUS`		The node is the previous node in the document.
`NEXT`		The node is the next node in the document.
`PARENT`		The node is the parent node in the document.
`CHILD`		The node is a child node in the document.

Source code in libs/core/src/serapeum/core/base/embeddings/types.py

class NodeType(str, Enum):
    """
    Node links used in `BaseNode` class.

    Attributes:
        SOURCE: The node is the source document.
        PREVIOUS: The node is the previous node in the document.
        NEXT: The node is the next node in the document.
        PARENT: The node is the parent node in the document.
        CHILD: The node is a child node in the document.

    """

    SOURCE = "source"
    PREVIOUS = "previous"
    NEXT = "next"
    PARENT = "parent"
    CHILD = "child"