Skip to content

Core Embeddings#

Core Embedding module#

serapeum.core.embeddings #

Embedding module.

BaseEmbedding #

Bases: SerializableModel, CallMixin, ABC

Abstract base class for all embedding model implementations.

This class provides the core interface and shared functionality for converting text into dense vector embeddings. It supports both query and document embedding, with optional caching, batching, and async operations.

Subclasses must implement the abstract methods for generating embeddings from text and queries. The class handles caching, batching, and progress tracking automatically.

Attributes:

Name Type Description
model_name str

Name of the embedding model. Defaults to "unknown".

batch_size int

Number of texts to process in each batch. Must be between 1 and 2048. Defaults to 10.

num_workers int | None

Number of worker threads for async operations. If None, uses default async behavior without worker pooling.

cache_store Any | None

Optional key-value store for caching embeddings. Must implement get(), aget(), put(), and aput() methods. When provided, embeddings are cached using a key combining text and model configuration.

Notes

This is an abstract base class and cannot be instantiated directly. Subclasses must implement _get_query_embedding, _aget_query_embedding, and _get_text_embedding methods.

See Also

serapeum.providers.ollama.embeddings.OllamaEmbedding: Concrete implementation for Ollama embedding models. CallMixin: Mixin providing call and acall methods. SerializableModel: Base Pydantic model with serialization support.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
class BaseEmbedding(SerializableModel, CallMixin, ABC):
    """Abstract base class for all embedding model implementations.

    This class provides the core interface and shared functionality for converting
    text into dense vector embeddings. It supports both query and document embedding,
    with optional caching, batching, and async operations.

    Subclasses must implement the abstract methods for generating embeddings from
    text and queries. The class handles caching, batching, and progress tracking
    automatically.

    Attributes:
        model_name: Name of the embedding model. Defaults to "unknown".
        batch_size: Number of texts to process in each batch. Must be between
            1 and 2048. Defaults to 10.
        num_workers: Number of worker threads for async operations. If None,
            uses default async behavior without worker pooling.
        cache_store: Optional key-value store for caching embeddings. Must implement
            get(), aget(), put(), and aput() methods. When provided, embeddings are
            cached using a key combining text and model configuration.

    Notes:
        This is an abstract base class and cannot be instantiated directly.
        Subclasses must implement _get_query_embedding, _aget_query_embedding,
        and _get_text_embedding methods.

    See Also:
        serapeum.providers.ollama.embeddings.OllamaEmbedding: Concrete implementation
            for Ollama embedding models.
        CallMixin: Mixin providing __call__ and acall methods.
        SerializableModel: Base Pydantic model with serialization support.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)
    model_name: str = Field(
        default="unknown", description="The name of the embedding model."
    )
    batch_size: int = Field(
        default=DEFAULT_EMBED_BATCH_SIZE,
        description="The batch size for embedding calls.",
        gt=0,
        le=2048,
    )
    num_workers: int | None = Field(
        default=None,
        description="The number of workers to use for async embedding calls.",
    )

    cache_store: Any | None = Field(
        default=None,
        description=(
            "Key-value store for caching embeddings. Must implement get(), aget(), "
            "put(), and aput() methods with signature: get(key: str, collection: str) -> dict | None. "
            "When provided, embeddings are cached using a key that combines the text and model configuration. "
            "If None, embeddings are not cached and will be recomputed on each call."
        ),
    )

    def _get_cache_key(self, text: str) -> str:
        """Generate a unique cache key combining text and model configuration.

        The cache key includes both the input text and a JSON representation of
        the model configuration, ensuring that different models or configurations
        don't share cached embeddings. Sensitive fields like api_key are excluded.

        Args:
            text: Input text to generate a cache key for.

        Returns:
            Cache key string in format "{text}::{model_config_json}".

        Notes:
            The following fields are excluded from the cache key:
                - api_key: Sensitive credential information
                - cache_store: Avoid circular reference in serialization
        """
        model_dict = self.to_dict()
        model_dict.pop("api_key", None)
        model_dict.pop("cache_store", None)  # Avoid circular reference

        # Create a deterministic string representation
        model_str = json.dumps(model_dict, sort_keys=True)
        return f"{text}::{model_str}"

    @abstractmethod
    def _get_query_embedding(self, query: str) -> Embedding:
        """Embed the input query synchronously (internal implementation).

        This is an internal method that subclasses must implement to provide
        the core query embedding functionality. The public method get_query_embedding()
        handles caching and calls this method when needed.

        Query embeddings may use special instructions or prefixes depending on the
        model. For example, some models prepend "Represent the question for
        retrieving supporting documents: " to optimize for retrieval tasks.

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_query_embedding: Public method that handles caching and delegates
                to this method.
            _aget_query_embedding: Async version of this method.
        """

    @abstractmethod
    async def _aget_query_embedding(self, query: str) -> Embedding:
        """Embed the input query asynchronously (internal implementation).

        This is an internal async method that subclasses must implement to provide
        the core query embedding functionality. The public method aget_query_embedding()
        handles caching and calls this method when needed.

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            aget_query_embedding: Public async method that handles caching and
                delegates to this method.
            _get_query_embedding: Sync version of this method.
        """

    def get_query_embedding(self, query: str) -> Embedding:
        """Generate an embedding vector for a query string.

        Embeds the input query into a dense vector representation optimized for
        retrieval tasks. When caching is enabled, checks the cache first and stores
        new embeddings automatically.

        Depending on the model, a special instruction may be prepended to the raw
        query string to optimize for specific tasks. For example, some models use
        "Represent the question for retrieving supporting documents: ".

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            aget_query_embedding: Async version of this method.
            get_text_embedding: For embedding document text (not queries).
            _get_query_embedding: Internal implementation method.
        """
        query_embedding = None
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            cached = self.cache_store.get(key=cache_key, collection="embeddings")
            if cached:
                cached_key = next(iter(cached.keys()))
                query_embedding = cached[cached_key]

        if query_embedding is None:
            query_embedding = self._get_query_embedding(query)
            if self.cache_store:
                cache_key = self._get_cache_key(query)
                self.cache_store.put(
                    key=cache_key,
                    val={str(uuid.uuid4()): query_embedding},
                    collection="embeddings",
                )

        return query_embedding

    async def aget_query_embedding(self, query: str) -> Embedding:
        """Asynchronously generate an embedding vector for a query string.

        Async version of get_query_embedding(). Embeds the input query into a dense
        vector representation with cache support.

        Args:
            query: Query text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_query_embedding: Sync version of this method.
            aget_text_embedding: For embedding document text asynchronously.
            _aget_query_embedding: Internal async implementation method.
        """
        query_embedding = None
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            cached = await self.cache_store.aget(key=cache_key, collection="embeddings")
            if cached:
                cached_key = next(iter(cached.keys()))
                query_embedding = cached[cached_key]

        if query_embedding is None:
            query_embedding = await self._aget_query_embedding(query)
            if self.cache_store:
                cache_key = self._get_cache_key(query)
                await self.cache_store.aput(
                    key=cache_key,
                    val={str(uuid.uuid4()): query_embedding},
                    collection="embeddings",
                )

        return query_embedding

    def get_agg_embedding_from_queries(
        self,
        queries: list[str],
        agg_fn: Callable[..., Embedding] | None = None,
    ) -> Embedding:
        """Generate a single aggregated embedding from multiple query strings.

        Embeds each query individually and then combines them using an aggregation
        function. This is useful for creating a unified representation from multiple
        related queries or questions.

        Args:
            queries: List of query strings to embed and aggregate.
            agg_fn: Optional aggregation function that takes a list of embeddings
                and returns a single embedding. Defaults to mean_agg (arithmetic mean).

        Returns:
            Single aggregated embedding vector as a list of floats.

        See Also:
            aget_agg_embedding_from_queries: Async version of this method.
            mean_agg: Default aggregation function.
            get_query_embedding: Used internally to embed each query.
        """
        query_embeddings = [self.get_query_embedding(query) for query in queries]
        agg_fn = agg_fn or mean_agg
        return agg_fn(query_embeddings)

    async def aget_agg_embedding_from_queries(
        self,
        queries: list[str],
        agg_fn: Callable[..., Embedding] | None = None,
    ) -> Embedding:
        """Asynchronously generate an aggregated embedding from multiple queries.

        Async version of get_agg_embedding_from_queries(). Embeds each query
        asynchronously and then combines them using an aggregation function.

        Args:
            queries: List of query strings to embed and aggregate.
            agg_fn: Optional aggregation function that takes a list of embeddings
                and returns a single embedding. Defaults to mean_agg.

        Returns:
            Single aggregated embedding vector as a list of floats.

        See Also:
            get_agg_embedding_from_queries: Sync version of this method.
            aget_query_embedding: Used internally to embed each query.
            mean_agg: Default aggregation function.
        """
        query_embeddings = [await self.aget_query_embedding(query) for query in queries]
        agg_fn = agg_fn or mean_agg
        return agg_fn(query_embeddings)

    @abstractmethod
    def _get_text_embedding(self, text: str) -> Embedding:
        """Embed document text synchronously (internal implementation).

        This is an internal method that subclasses must implement to provide
        the core text embedding functionality. The public method get_text_embedding()
        handles caching and calls this method when needed.

        Text embeddings may use different instructions or prefixes than query
        embeddings. For example, some models prepend "Represent the document for
        retrieval: " to optimize for document representation.

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_text_embedding: Public method that handles caching and delegates
                to this method.
            _aget_text_embedding: Async version of this method.
            _get_query_embedding: For embedding queries (not documents).
        """

    async def _aget_text_embedding(self, text: str) -> Embedding:
        """Embed document text asynchronously (internal implementation).

        This is an internal async method that subclasses can override to provide
        true async text embedding. The default implementation falls back to the
        sync method _get_text_embedding().

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        Notes:
            Subclasses should override this method if they have a native async
            implementation. Otherwise, the default fallback to the sync method
            is used.

        See Also:
            aget_text_embedding: Public async method that handles caching.
            _get_text_embedding: Sync version used as fallback.
        """
        # Default implementation just falls back on _get_text_embedding
        return self._get_text_embedding(text)

    def _get_text_embeddings(self, texts: list[str]) -> list[Embedding]:
        """Embed multiple texts synchronously in batch (internal implementation).

        This internal method provides batch embedding functionality. Subclasses
        can override this method to provide more efficient batch processing if
        supported by the underlying model API.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        Notes:
            The default implementation simply loops over _get_text_embedding().
            Subclasses should override this if they can process batches more
            efficiently.

        See Also:
            get_text_embedding_batch: Public method for batch embedding with
                batching and progress tracking.
            _aget_text_embeddings: Async version of this method.
        """
        # Default implementation just loops over _get_text_embedding
        return [self._get_text_embedding(text) for text in texts]

    async def _aget_text_embeddings(self, texts: list[str]) -> list[Embedding]:
        """Embed multiple texts asynchronously in batch (internal implementation).

        This internal async method provides batch embedding functionality using
        asyncio.gather for concurrent processing. Subclasses can override this
        for more efficient batch processing.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        Notes:
            The default implementation uses asyncio.gather to process all texts
            concurrently via _aget_text_embedding(). Subclasses should override
            this if they can process batches more efficiently.

        See Also:
            aget_text_embedding_batch: Public async method for batch embedding.
            _get_text_embeddings: Sync version of this method.
        """
        return await asyncio.gather(
            *[self._aget_text_embedding(text) for text in texts]
        )

    def _get_text_embeddings_cached(self, texts: list[str]) -> list[Embedding]:
        """Retrieve text embeddings from cache or generate if not cached.

        This internal method checks the cache for each text and only generates
        embeddings for texts not found in the cache. Newly generated embeddings
        are automatically stored in the cache.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, preserving order.

        Raises:
            ValueError: If cache_store is None when this method is called.

        See Also:
            _aget_text_embeddings_cached: Async version of this method.
            get_text_embedding_batch: Public method that uses this for caching.
        """
        if self.cache_store is None:
            raise ValueError("embeddings_cache must be defined")

        embeddings: list[Embedding | None] = [None for i in range(len(texts))]
        # Tuples of (index, text) to be able to keep same order of embeddings
        non_cached_texts: list[tuple[int, str]] = []
        for i, txt in enumerate(texts):
            cache_key = self._get_cache_key(txt)
            cached_emb = self.cache_store.get(key=cache_key, collection="embeddings")
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                embeddings[i] = cached_emb[cached_key]
            else:
                non_cached_texts.append((i, txt))
        if len(non_cached_texts) > 0:
            text_embeddings = self._get_text_embeddings(
                [x[1] for x in non_cached_texts]
            )
            for j, text_embedding in enumerate(text_embeddings):
                orig_i = non_cached_texts[j][0]
                embeddings[orig_i] = text_embedding

                cache_key = self._get_cache_key(texts[orig_i])
                self.cache_store.put(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )
        return embeddings

    async def _aget_text_embeddings_cached(self, texts: list[str]) -> list[Embedding]:
        """Asynchronously retrieve text embeddings from cache or generate them.

        Async version of _get_text_embeddings_cached(). Checks the cache for each
        text and generates embeddings only for texts not found. Newly generated
        embeddings are automatically stored in the cache.

        Args:
            texts: List of document texts to embed.

        Returns:
            List of embedding vectors, one for each input text, preserving order.

        Raises:
            ValueError: If cache_store is None when this method is called.

        See Also:
            _get_text_embeddings_cached: Sync version of this method.
            aget_text_embedding_batch: Public async method that uses this.
        """
        if self.cache_store is None:
            raise ValueError("embeddings_cache must be defined")

        embeddings: list[Embedding | None] = [None for i in range(len(texts))]
        # Tuples of (index, text) to be able to keep same order of embeddings
        non_cached_texts: list[tuple[int, str]] = []
        for i, txt in enumerate(texts):
            cache_key = self._get_cache_key(txt)
            cached_emb = await self.cache_store.aget(
                key=cache_key, collection="embeddings"
            )
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                embeddings[i] = cached_emb[cached_key]
            else:
                non_cached_texts.append((i, txt))

        if len(non_cached_texts) > 0:
            text_embeddings = await self._aget_text_embeddings(
                [x[1] for x in non_cached_texts]
            )
            for j, text_embedding in enumerate(text_embeddings):
                orig_i = non_cached_texts[j][0]
                embeddings[orig_i] = text_embedding
                cache_key = self._get_cache_key(texts[orig_i])
                await self.cache_store.aput(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )
        return embeddings

    def get_text_embedding(self, text: str) -> Embedding:
        """Generate an embedding vector for document text.

        Embeds the input text into a dense vector representation optimized for
        document representation tasks. When caching is enabled, checks the cache
        first and stores new embeddings automatically.

        Depending on the model, a special instruction may be prepended to the raw
        text string to optimize for document retrieval. For example, some models
        use "Represent the document for retrieval: ".

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            aget_text_embedding: Async version of this method.
            get_query_embedding: For embedding queries (not documents).
            get_text_embedding_batch: For embedding multiple texts efficiently.
            _get_text_embedding: Internal implementation method.
        """
        if not self.cache_store:
            text_embedding = self._get_text_embedding(text)
        elif self.cache_store is not None:
            cache_key = self._get_cache_key(text)
            cached_emb = self.cache_store.get(key=cache_key, collection="embeddings")
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                text_embedding = cached_emb[cached_key]
            else:
                text_embedding = self._get_text_embedding(text)
                cache_key = self._get_cache_key(text)
                self.cache_store.put(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )

        return text_embedding

    async def aget_text_embedding(self, text: str) -> Embedding:
        """Asynchronously generate an embedding vector for document text.

        Async version of get_text_embedding(). Embeds the input text into a dense
        vector representation with cache support.

        Args:
            text: Document text to embed.

        Returns:
            Embedding vector as a list of floats.

        See Also:
            get_text_embedding: Sync version of this method.
            aget_query_embedding: For embedding queries asynchronously.
            aget_text_embedding_batch: For embedding multiple texts efficiently.
            _aget_text_embedding: Internal async implementation method.
        """
        if not self.cache_store:
            text_embedding = await self._aget_text_embedding(text)
        elif self.cache_store is not None:
            cache_key = self._get_cache_key(text)
            cached_emb = await self.cache_store.aget(
                key=cache_key, collection="embeddings"
            )
            if cached_emb is not None:
                cached_key = next(iter(cached_emb.keys()))
                text_embedding = cached_emb[cached_key]
            else:
                text_embedding = await self._aget_text_embedding(text)
                cache_key = self._get_cache_key(text)
                await self.cache_store.aput(
                    key=cache_key,
                    val={str(uuid.uuid4()): text_embedding},
                    collection="embeddings",
                )

        return text_embedding

    def get_text_embedding_batch(
        self,
        texts: list[str],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> list[Embedding]:
        """Generate embeddings for multiple texts with automatic batching.

        Processes a list of texts in batches according to self.batch_size. Supports
        optional progress tracking and automatic caching if cache_store is configured.

        Args:
            texts: List of document texts to embed.
            show_progress: Whether to display a progress bar. Defaults to False.
            **kwargs: Additional keyword arguments (reserved for future use).

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        See Also:
            aget_text_embedding_batch: Async version with parallel processing.
            get_text_embedding: For embedding a single text.
            _get_text_embeddings: Internal batch processing method.
            _get_text_embeddings_cached: Internal cached batch processing.
        """
        cur_batch: list[str] = []
        result_embeddings: list[Embedding] = []

        queue_with_progress = enumerate(
            get_tqdm_iterable(texts, show_progress, "Generating embeddings")
        )

        for idx, text in queue_with_progress:
            cur_batch.append(text)
            if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
                # flush
                if not self.cache_store:
                    embeddings = self._get_text_embeddings(cur_batch)
                elif self.cache_store is not None:
                    embeddings = self._get_text_embeddings_cached(cur_batch)

                result_embeddings.extend(embeddings)

                cur_batch = []

        return result_embeddings

    async def aget_text_embedding_batch(
        self,
        texts: list[str],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> list[Embedding]:
        """Asynchronously generate embeddings for multiple texts with batching.

        Async version of get_text_embedding_batch(). Processes texts in batches
        with concurrent execution for improved performance. Supports worker pooling
        if num_workers is set.

        Args:
            texts: List of document texts to embed.
            show_progress: Whether to display a progress bar. Defaults to False.
                Requires tqdm package for progress tracking.
            **kwargs: Additional keyword arguments (reserved for future use).

        Returns:
            List of embedding vectors, one for each input text, in the same order.

        Notes:
            When num_workers > 1, uses worker pooling for concurrent batch processing.
            When show_progress=True, attempts to use tqdm.asyncio for progress tracking.

        See Also:
            get_text_embedding_batch: Sync version of this method.
            aget_text_embedding: For embedding a single text asynchronously.
            _aget_text_embeddings: Internal async batch processing method.
        """
        num_workers = self.num_workers

        cur_batch: list[str] = []
        embeddings_coroutines: list[Coroutine] = []

        # for idx, text in queue_with_progress:
        for idx, text in enumerate(texts):
            cur_batch.append(text)
            if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
                # flush

                if not self.cache_store:
                    embeddings_coroutines.append(self._aget_text_embeddings(cur_batch))
                elif self.cache_store is not None:
                    embeddings_coroutines.append(
                        self._aget_text_embeddings_cached(cur_batch)
                    )

                cur_batch = []

        # flatten the results of asyncio.gather, which is a list of embeddings lists
        if len(embeddings_coroutines) > 0:
            if num_workers and num_workers > 1:
                nested_embeddings = await run_jobs(
                    embeddings_coroutines,
                    show_progress=show_progress,
                    workers=self.num_workers,
                    desc="Generating embeddings",
                )
            elif show_progress:
                try:
                    nested_embeddings = await tqdm_asyncio.gather(
                        *embeddings_coroutines,
                        total=len(embeddings_coroutines),
                        desc="Generating embeddings",
                    )
                except ImportError:
                    nested_embeddings = await asyncio.gather(*embeddings_coroutines)
            else:
                nested_embeddings = await asyncio.gather(*embeddings_coroutines)
        else:
            nested_embeddings = []

        result_embeddings = [
            embedding for embeddings in nested_embeddings for embedding in embeddings
        ]
        return result_embeddings

    @staticmethod
    def similarity(
        embedding1: Embedding,
        embedding2: Embedding,
        mode: SimilarityMode = SimilarityMode.DEFAULT,
    ) -> float:
        """Calculate similarity between two embedding vectors.

        Static method wrapper for the module-level similarity() function. Provides
        a convenient way to compute similarity directly from the class.

        Args:
            embedding1: First embedding vector (list of floats).
            embedding2: Second embedding vector (list of floats).
            mode: Similarity computation mode. Defaults to cosine similarity.

        Returns:
            Similarity score as a float. Interpretation depends on the mode.

        Examples:
            - Computing cosine similarity
                ```python
                >>> from serapeum.core.embeddings import BaseEmbedding  # type: ignore
                >>> emb1 = [1.0, 0.0]
                >>> emb2 = [1.0, 0.0]
                >>> float(BaseEmbedding.similarity(emb1, emb2))
                1.0

                ```

            - Using different similarity modes
                ```python
                >>> emb1 = [3.0, 4.0]
                >>> emb2 = [3.0, 4.0]
                >>> float(BaseEmbedding.similarity(emb1, emb2, mode=SimilarityMode.DOT_PRODUCT))
                25.0

                ```

        See Also:
            similarity: Module-level function that performs the actual calculation.
            SimilarityMode: Enum defining available similarity modes.
        """
        return similarity(embedding1=embedding1, embedding2=embedding2, mode=mode)

    def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
        """Embed a sequence of nodes by calling the embedding model.

        This makes the embedding model callable, allowing it to be used as a function.
        Extracts text content from each node, generates embeddings, and assigns them
        back to the nodes.

        Args:
            nodes: Sequence of BaseNode objects to embed.
            **kwargs: Additional keyword arguments passed to get_text_embedding_batch.

        Returns:
            The input sequence of nodes with embeddings assigned to each node's
            embedding attribute.

        See Also:
            acall: Async version of this method.
            get_text_embedding_batch: Method used internally for batch embedding.
            MetadataMode.EMBED: Mode used to extract content from nodes.
        """
        embeddings = self.get_text_embedding_batch(
            [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
            **kwargs,
        )

        for node, embedding in zip(nodes, embeddings):
            node.embedding = embedding

        return nodes

    async def acall(
        self, nodes: Sequence[BaseNode], **kwargs: Any
    ) -> Sequence[BaseNode]:
        """Asynchronously embed a sequence of nodes.

        Async version of __call__(). Extracts text content from each node,
        generates embeddings asynchronously, and assigns them back to the nodes.

        Args:
            nodes: Sequence of BaseNode objects to embed.
            **kwargs: Additional keyword arguments passed to aget_text_embedding_batch.

        Returns:
            The input sequence of nodes with embeddings assigned to each node's
            embedding attribute.

        See Also:
            __call__: Sync version of this method.
            aget_text_embedding_batch: Method used internally for async batch embedding.
            MetadataMode.EMBED: Mode used to extract content from nodes.
        """
        embeddings = await self.aget_text_embedding_batch(
            [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
            **kwargs,
        )

        for node, embedding in zip(nodes, embeddings):
            node.embedding = embedding

        return nodes

__call__(nodes, **kwargs) #

Embed a sequence of nodes by calling the embedding model.

This makes the embedding model callable, allowing it to be used as a function. Extracts text content from each node, generates embeddings, and assigns them back to the nodes.

Parameters:

Name Type Description Default
nodes Sequence[BaseNode]

Sequence of BaseNode objects to embed.

required
**kwargs Any

Additional keyword arguments passed to get_text_embedding_batch.

{}

Returns:

Type Description
Sequence[BaseNode]

The input sequence of nodes with embeddings assigned to each node's

Sequence[BaseNode]

embedding attribute.

See Also

acall: Async version of this method. get_text_embedding_batch: Method used internally for batch embedding. MetadataMode.EMBED: Mode used to extract content from nodes.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
    """Embed a sequence of nodes by calling the embedding model.

    This makes the embedding model callable, allowing it to be used as a function.
    Extracts text content from each node, generates embeddings, and assigns them
    back to the nodes.

    Args:
        nodes: Sequence of BaseNode objects to embed.
        **kwargs: Additional keyword arguments passed to get_text_embedding_batch.

    Returns:
        The input sequence of nodes with embeddings assigned to each node's
        embedding attribute.

    See Also:
        acall: Async version of this method.
        get_text_embedding_batch: Method used internally for batch embedding.
        MetadataMode.EMBED: Mode used to extract content from nodes.
    """
    embeddings = self.get_text_embedding_batch(
        [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
        **kwargs,
    )

    for node, embedding in zip(nodes, embeddings):
        node.embedding = embedding

    return nodes

acall(nodes, **kwargs) async #

Asynchronously embed a sequence of nodes.

Async version of call(). Extracts text content from each node, generates embeddings asynchronously, and assigns them back to the nodes.

Parameters:

Name Type Description Default
nodes Sequence[BaseNode]

Sequence of BaseNode objects to embed.

required
**kwargs Any

Additional keyword arguments passed to aget_text_embedding_batch.

{}

Returns:

Type Description
Sequence[BaseNode]

The input sequence of nodes with embeddings assigned to each node's

Sequence[BaseNode]

embedding attribute.

See Also

call: Sync version of this method. aget_text_embedding_batch: Method used internally for async batch embedding. MetadataMode.EMBED: Mode used to extract content from nodes.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
async def acall(
    self, nodes: Sequence[BaseNode], **kwargs: Any
) -> Sequence[BaseNode]:
    """Asynchronously embed a sequence of nodes.

    Async version of __call__(). Extracts text content from each node,
    generates embeddings asynchronously, and assigns them back to the nodes.

    Args:
        nodes: Sequence of BaseNode objects to embed.
        **kwargs: Additional keyword arguments passed to aget_text_embedding_batch.

    Returns:
        The input sequence of nodes with embeddings assigned to each node's
        embedding attribute.

    See Also:
        __call__: Sync version of this method.
        aget_text_embedding_batch: Method used internally for async batch embedding.
        MetadataMode.EMBED: Mode used to extract content from nodes.
    """
    embeddings = await self.aget_text_embedding_batch(
        [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes],
        **kwargs,
    )

    for node, embedding in zip(nodes, embeddings):
        node.embedding = embedding

    return nodes

aget_agg_embedding_from_queries(queries, agg_fn=None) async #

Asynchronously generate an aggregated embedding from multiple queries.

Async version of get_agg_embedding_from_queries(). Embeds each query asynchronously and then combines them using an aggregation function.

Parameters:

Name Type Description Default
queries list[str]

List of query strings to embed and aggregate.

required
agg_fn Callable[..., Embedding] | None

Optional aggregation function that takes a list of embeddings and returns a single embedding. Defaults to mean_agg.

None

Returns:

Type Description
Embedding

Single aggregated embedding vector as a list of floats.

See Also

get_agg_embedding_from_queries: Sync version of this method. aget_query_embedding: Used internally to embed each query. mean_agg: Default aggregation function.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
async def aget_agg_embedding_from_queries(
    self,
    queries: list[str],
    agg_fn: Callable[..., Embedding] | None = None,
) -> Embedding:
    """Asynchronously generate an aggregated embedding from multiple queries.

    Async version of get_agg_embedding_from_queries(). Embeds each query
    asynchronously and then combines them using an aggregation function.

    Args:
        queries: List of query strings to embed and aggregate.
        agg_fn: Optional aggregation function that takes a list of embeddings
            and returns a single embedding. Defaults to mean_agg.

    Returns:
        Single aggregated embedding vector as a list of floats.

    See Also:
        get_agg_embedding_from_queries: Sync version of this method.
        aget_query_embedding: Used internally to embed each query.
        mean_agg: Default aggregation function.
    """
    query_embeddings = [await self.aget_query_embedding(query) for query in queries]
    agg_fn = agg_fn or mean_agg
    return agg_fn(query_embeddings)

aget_query_embedding(query) async #

Asynchronously generate an embedding vector for a query string.

Async version of get_query_embedding(). Embeds the input query into a dense vector representation with cache support.

Parameters:

Name Type Description Default
query str

Query text to embed.

required

Returns:

Type Description
Embedding

Embedding vector as a list of floats.

See Also

get_query_embedding: Sync version of this method. aget_text_embedding: For embedding document text asynchronously. _aget_query_embedding: Internal async implementation method.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
async def aget_query_embedding(self, query: str) -> Embedding:
    """Asynchronously generate an embedding vector for a query string.

    Async version of get_query_embedding(). Embeds the input query into a dense
    vector representation with cache support.

    Args:
        query: Query text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        get_query_embedding: Sync version of this method.
        aget_text_embedding: For embedding document text asynchronously.
        _aget_query_embedding: Internal async implementation method.
    """
    query_embedding = None
    if self.cache_store:
        cache_key = self._get_cache_key(query)
        cached = await self.cache_store.aget(key=cache_key, collection="embeddings")
        if cached:
            cached_key = next(iter(cached.keys()))
            query_embedding = cached[cached_key]

    if query_embedding is None:
        query_embedding = await self._aget_query_embedding(query)
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            await self.cache_store.aput(
                key=cache_key,
                val={str(uuid.uuid4()): query_embedding},
                collection="embeddings",
            )

    return query_embedding

aget_text_embedding(text) async #

Asynchronously generate an embedding vector for document text.

Async version of get_text_embedding(). Embeds the input text into a dense vector representation with cache support.

Parameters:

Name Type Description Default
text str

Document text to embed.

required

Returns:

Type Description
Embedding

Embedding vector as a list of floats.

See Also

get_text_embedding: Sync version of this method. aget_query_embedding: For embedding queries asynchronously. aget_text_embedding_batch: For embedding multiple texts efficiently. _aget_text_embedding: Internal async implementation method.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
async def aget_text_embedding(self, text: str) -> Embedding:
    """Asynchronously generate an embedding vector for document text.

    Async version of get_text_embedding(). Embeds the input text into a dense
    vector representation with cache support.

    Args:
        text: Document text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        get_text_embedding: Sync version of this method.
        aget_query_embedding: For embedding queries asynchronously.
        aget_text_embedding_batch: For embedding multiple texts efficiently.
        _aget_text_embedding: Internal async implementation method.
    """
    if not self.cache_store:
        text_embedding = await self._aget_text_embedding(text)
    elif self.cache_store is not None:
        cache_key = self._get_cache_key(text)
        cached_emb = await self.cache_store.aget(
            key=cache_key, collection="embeddings"
        )
        if cached_emb is not None:
            cached_key = next(iter(cached_emb.keys()))
            text_embedding = cached_emb[cached_key]
        else:
            text_embedding = await self._aget_text_embedding(text)
            cache_key = self._get_cache_key(text)
            await self.cache_store.aput(
                key=cache_key,
                val={str(uuid.uuid4()): text_embedding},
                collection="embeddings",
            )

    return text_embedding

aget_text_embedding_batch(texts, show_progress=False, **kwargs) async #

Asynchronously generate embeddings for multiple texts with batching.

Async version of get_text_embedding_batch(). Processes texts in batches with concurrent execution for improved performance. Supports worker pooling if num_workers is set.

Parameters:

Name Type Description Default
texts list[str]

List of document texts to embed.

required
show_progress bool

Whether to display a progress bar. Defaults to False. Requires tqdm package for progress tracking.

False
**kwargs Any

Additional keyword arguments (reserved for future use).

{}

Returns:

Type Description
list[Embedding]

List of embedding vectors, one for each input text, in the same order.

Notes

When num_workers > 1, uses worker pooling for concurrent batch processing. When show_progress=True, attempts to use tqdm.asyncio for progress tracking.

See Also

get_text_embedding_batch: Sync version of this method. aget_text_embedding: For embedding a single text asynchronously. _aget_text_embeddings: Internal async batch processing method.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
async def aget_text_embedding_batch(
    self,
    texts: list[str],
    show_progress: bool = False,
    **kwargs: Any,
) -> list[Embedding]:
    """Asynchronously generate embeddings for multiple texts with batching.

    Async version of get_text_embedding_batch(). Processes texts in batches
    with concurrent execution for improved performance. Supports worker pooling
    if num_workers is set.

    Args:
        texts: List of document texts to embed.
        show_progress: Whether to display a progress bar. Defaults to False.
            Requires tqdm package for progress tracking.
        **kwargs: Additional keyword arguments (reserved for future use).

    Returns:
        List of embedding vectors, one for each input text, in the same order.

    Notes:
        When num_workers > 1, uses worker pooling for concurrent batch processing.
        When show_progress=True, attempts to use tqdm.asyncio for progress tracking.

    See Also:
        get_text_embedding_batch: Sync version of this method.
        aget_text_embedding: For embedding a single text asynchronously.
        _aget_text_embeddings: Internal async batch processing method.
    """
    num_workers = self.num_workers

    cur_batch: list[str] = []
    embeddings_coroutines: list[Coroutine] = []

    # for idx, text in queue_with_progress:
    for idx, text in enumerate(texts):
        cur_batch.append(text)
        if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
            # flush

            if not self.cache_store:
                embeddings_coroutines.append(self._aget_text_embeddings(cur_batch))
            elif self.cache_store is not None:
                embeddings_coroutines.append(
                    self._aget_text_embeddings_cached(cur_batch)
                )

            cur_batch = []

    # flatten the results of asyncio.gather, which is a list of embeddings lists
    if len(embeddings_coroutines) > 0:
        if num_workers and num_workers > 1:
            nested_embeddings = await run_jobs(
                embeddings_coroutines,
                show_progress=show_progress,
                workers=self.num_workers,
                desc="Generating embeddings",
            )
        elif show_progress:
            try:
                nested_embeddings = await tqdm_asyncio.gather(
                    *embeddings_coroutines,
                    total=len(embeddings_coroutines),
                    desc="Generating embeddings",
                )
            except ImportError:
                nested_embeddings = await asyncio.gather(*embeddings_coroutines)
        else:
            nested_embeddings = await asyncio.gather(*embeddings_coroutines)
    else:
        nested_embeddings = []

    result_embeddings = [
        embedding for embeddings in nested_embeddings for embedding in embeddings
    ]
    return result_embeddings

get_agg_embedding_from_queries(queries, agg_fn=None) #

Generate a single aggregated embedding from multiple query strings.

Embeds each query individually and then combines them using an aggregation function. This is useful for creating a unified representation from multiple related queries or questions.

Parameters:

Name Type Description Default
queries list[str]

List of query strings to embed and aggregate.

required
agg_fn Callable[..., Embedding] | None

Optional aggregation function that takes a list of embeddings and returns a single embedding. Defaults to mean_agg (arithmetic mean).

None

Returns:

Type Description
Embedding

Single aggregated embedding vector as a list of floats.

See Also

aget_agg_embedding_from_queries: Async version of this method. mean_agg: Default aggregation function. get_query_embedding: Used internally to embed each query.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
def get_agg_embedding_from_queries(
    self,
    queries: list[str],
    agg_fn: Callable[..., Embedding] | None = None,
) -> Embedding:
    """Generate a single aggregated embedding from multiple query strings.

    Embeds each query individually and then combines them using an aggregation
    function. This is useful for creating a unified representation from multiple
    related queries or questions.

    Args:
        queries: List of query strings to embed and aggregate.
        agg_fn: Optional aggregation function that takes a list of embeddings
            and returns a single embedding. Defaults to mean_agg (arithmetic mean).

    Returns:
        Single aggregated embedding vector as a list of floats.

    See Also:
        aget_agg_embedding_from_queries: Async version of this method.
        mean_agg: Default aggregation function.
        get_query_embedding: Used internally to embed each query.
    """
    query_embeddings = [self.get_query_embedding(query) for query in queries]
    agg_fn = agg_fn or mean_agg
    return agg_fn(query_embeddings)

get_query_embedding(query) #

Generate an embedding vector for a query string.

Embeds the input query into a dense vector representation optimized for retrieval tasks. When caching is enabled, checks the cache first and stores new embeddings automatically.

Depending on the model, a special instruction may be prepended to the raw query string to optimize for specific tasks. For example, some models use "Represent the question for retrieving supporting documents: ".

Parameters:

Name Type Description Default
query str

Query text to embed.

required

Returns:

Type Description
Embedding

Embedding vector as a list of floats.

See Also

aget_query_embedding: Async version of this method. get_text_embedding: For embedding document text (not queries). _get_query_embedding: Internal implementation method.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
def get_query_embedding(self, query: str) -> Embedding:
    """Generate an embedding vector for a query string.

    Embeds the input query into a dense vector representation optimized for
    retrieval tasks. When caching is enabled, checks the cache first and stores
    new embeddings automatically.

    Depending on the model, a special instruction may be prepended to the raw
    query string to optimize for specific tasks. For example, some models use
    "Represent the question for retrieving supporting documents: ".

    Args:
        query: Query text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        aget_query_embedding: Async version of this method.
        get_text_embedding: For embedding document text (not queries).
        _get_query_embedding: Internal implementation method.
    """
    query_embedding = None
    if self.cache_store:
        cache_key = self._get_cache_key(query)
        cached = self.cache_store.get(key=cache_key, collection="embeddings")
        if cached:
            cached_key = next(iter(cached.keys()))
            query_embedding = cached[cached_key]

    if query_embedding is None:
        query_embedding = self._get_query_embedding(query)
        if self.cache_store:
            cache_key = self._get_cache_key(query)
            self.cache_store.put(
                key=cache_key,
                val={str(uuid.uuid4()): query_embedding},
                collection="embeddings",
            )

    return query_embedding

get_text_embedding(text) #

Generate an embedding vector for document text.

Embeds the input text into a dense vector representation optimized for document representation tasks. When caching is enabled, checks the cache first and stores new embeddings automatically.

Depending on the model, a special instruction may be prepended to the raw text string to optimize for document retrieval. For example, some models use "Represent the document for retrieval: ".

Parameters:

Name Type Description Default
text str

Document text to embed.

required

Returns:

Type Description
Embedding

Embedding vector as a list of floats.

See Also

aget_text_embedding: Async version of this method. get_query_embedding: For embedding queries (not documents). get_text_embedding_batch: For embedding multiple texts efficiently. _get_text_embedding: Internal implementation method.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
def get_text_embedding(self, text: str) -> Embedding:
    """Generate an embedding vector for document text.

    Embeds the input text into a dense vector representation optimized for
    document representation tasks. When caching is enabled, checks the cache
    first and stores new embeddings automatically.

    Depending on the model, a special instruction may be prepended to the raw
    text string to optimize for document retrieval. For example, some models
    use "Represent the document for retrieval: ".

    Args:
        text: Document text to embed.

    Returns:
        Embedding vector as a list of floats.

    See Also:
        aget_text_embedding: Async version of this method.
        get_query_embedding: For embedding queries (not documents).
        get_text_embedding_batch: For embedding multiple texts efficiently.
        _get_text_embedding: Internal implementation method.
    """
    if not self.cache_store:
        text_embedding = self._get_text_embedding(text)
    elif self.cache_store is not None:
        cache_key = self._get_cache_key(text)
        cached_emb = self.cache_store.get(key=cache_key, collection="embeddings")
        if cached_emb is not None:
            cached_key = next(iter(cached_emb.keys()))
            text_embedding = cached_emb[cached_key]
        else:
            text_embedding = self._get_text_embedding(text)
            cache_key = self._get_cache_key(text)
            self.cache_store.put(
                key=cache_key,
                val={str(uuid.uuid4()): text_embedding},
                collection="embeddings",
            )

    return text_embedding

get_text_embedding_batch(texts, show_progress=False, **kwargs) #

Generate embeddings for multiple texts with automatic batching.

Processes a list of texts in batches according to self.batch_size. Supports optional progress tracking and automatic caching if cache_store is configured.

Parameters:

Name Type Description Default
texts list[str]

List of document texts to embed.

required
show_progress bool

Whether to display a progress bar. Defaults to False.

False
**kwargs Any

Additional keyword arguments (reserved for future use).

{}

Returns:

Type Description
list[Embedding]

List of embedding vectors, one for each input text, in the same order.

See Also

aget_text_embedding_batch: Async version with parallel processing. get_text_embedding: For embedding a single text. _get_text_embeddings: Internal batch processing method. _get_text_embeddings_cached: Internal cached batch processing.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
def get_text_embedding_batch(
    self,
    texts: list[str],
    show_progress: bool = False,
    **kwargs: Any,
) -> list[Embedding]:
    """Generate embeddings for multiple texts with automatic batching.

    Processes a list of texts in batches according to self.batch_size. Supports
    optional progress tracking and automatic caching if cache_store is configured.

    Args:
        texts: List of document texts to embed.
        show_progress: Whether to display a progress bar. Defaults to False.
        **kwargs: Additional keyword arguments (reserved for future use).

    Returns:
        List of embedding vectors, one for each input text, in the same order.

    See Also:
        aget_text_embedding_batch: Async version with parallel processing.
        get_text_embedding: For embedding a single text.
        _get_text_embeddings: Internal batch processing method.
        _get_text_embeddings_cached: Internal cached batch processing.
    """
    cur_batch: list[str] = []
    result_embeddings: list[Embedding] = []

    queue_with_progress = enumerate(
        get_tqdm_iterable(texts, show_progress, "Generating embeddings")
    )

    for idx, text in queue_with_progress:
        cur_batch.append(text)
        if idx == len(texts) - 1 or len(cur_batch) == self.batch_size:
            # flush
            if not self.cache_store:
                embeddings = self._get_text_embeddings(cur_batch)
            elif self.cache_store is not None:
                embeddings = self._get_text_embeddings_cached(cur_batch)

            result_embeddings.extend(embeddings)

            cur_batch = []

    return result_embeddings

similarity(embedding1, embedding2, mode=SimilarityMode.DEFAULT) staticmethod #

Calculate similarity between two embedding vectors.

Static method wrapper for the module-level similarity() function. Provides a convenient way to compute similarity directly from the class.

Parameters:

Name Type Description Default
embedding1 Embedding

First embedding vector (list of floats).

required
embedding2 Embedding

Second embedding vector (list of floats).

required
mode SimilarityMode

Similarity computation mode. Defaults to cosine similarity.

DEFAULT

Returns:

Type Description
float

Similarity score as a float. Interpretation depends on the mode.

Examples:

  • Computing cosine similarity

    >>> from serapeum.core.embeddings import BaseEmbedding  # type: ignore
    >>> emb1 = [1.0, 0.0]
    >>> emb2 = [1.0, 0.0]
    >>> float(BaseEmbedding.similarity(emb1, emb2))
    1.0
    

  • Using different similarity modes

    >>> emb1 = [3.0, 4.0]
    >>> emb2 = [3.0, 4.0]
    >>> float(BaseEmbedding.similarity(emb1, emb2, mode=SimilarityMode.DOT_PRODUCT))
    25.0
    

See Also

similarity: Module-level function that performs the actual calculation. SimilarityMode: Enum defining available similarity modes.

Source code in libs\core\src\serapeum\core\base\embeddings\base.py
@staticmethod
def similarity(
    embedding1: Embedding,
    embedding2: Embedding,
    mode: SimilarityMode = SimilarityMode.DEFAULT,
) -> float:
    """Calculate similarity between two embedding vectors.

    Static method wrapper for the module-level similarity() function. Provides
    a convenient way to compute similarity directly from the class.

    Args:
        embedding1: First embedding vector (list of floats).
        embedding2: Second embedding vector (list of floats).
        mode: Similarity computation mode. Defaults to cosine similarity.

    Returns:
        Similarity score as a float. Interpretation depends on the mode.

    Examples:
        - Computing cosine similarity
            ```python
            >>> from serapeum.core.embeddings import BaseEmbedding  # type: ignore
            >>> emb1 = [1.0, 0.0]
            >>> emb2 = [1.0, 0.0]
            >>> float(BaseEmbedding.similarity(emb1, emb2))
            1.0

            ```

        - Using different similarity modes
            ```python
            >>> emb1 = [3.0, 4.0]
            >>> emb2 = [3.0, 4.0]
            >>> float(BaseEmbedding.similarity(emb1, emb2, mode=SimilarityMode.DOT_PRODUCT))
            25.0

            ```

    See Also:
        similarity: Module-level function that performs the actual calculation.
        SimilarityMode: Enum defining available similarity modes.
    """
    return similarity(embedding1=embedding1, embedding2=embedding2, mode=mode)

BaseNode #

Bases: SerializableModel, ABC

Abstract base class for document nodes with metadata and relationship management.

BaseNode provides the foundational functionality for representing chunks of documents with rich metadata, embeddings, and hierarchical relationships. It supports selective metadata inclusion for different contexts (LLM vs embeddings), automatic change detection via hashing, and efficient relationship caching.

Key features: - Automatic UUID generation for node identification - Metadata management with selective inclusion/exclusion for LLM and embedding contexts - Relationship tracking (source, parent, children, previous, next) - Embedding storage and retrieval - Cached LinkedNodes computation with automatic invalidation - Customizable metadata formatting and serialization

Attributes:

Name Type Description
id str

Unique identifier for the node (auto-generated UUID if not provided).

embedding list[float] | None

Optional vector embedding for the node's content.

metadata dict[str, Any]

Flat dictionary of metadata fields used for context and filtering.

excluded_embed_metadata_keys list[str]

Metadata keys excluded from embedding context.

excluded_llm_metadata_keys list[str]

Metadata keys excluded from LLM context.

links dict[Annotated[NodeType, EnumNameSerializer], NodeInfoType]

Dictionary mapping NodeType to NodeInfo for relationships.

metadata_template str

Template string for formatting metadata (default: "{key}: {value}").

metadata_separator str

Separator between metadata fields (default: newline).

Note

This is an abstract base class. Subclasses must implement: - get_type(): Return the node's content type identifier - get_content(): Return the node's content with optional metadata - set_content(): Update the node's content - hash: Property returning the content hash for change detection

Examples:

  • Creating a concrete node subclass
    >>> from serapeum.core.base.embeddings.types import BaseNode, MetadataMode, NodeType, NodeInfo
    >>> import hashlib
    >>> from pydantic import Field
    >>>
    >>> class TextNode(BaseNode):
    ...     text: str = Field(default="", description="Text content of the node")
    ...
    ...     @classmethod
    ...     def get_type(cls) -> str:
    ...         return "text"
    ...
    ...     def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
    ...         metadata_str = self.get_metadata_str(mode=metadata_mode)
    ...         return f"{metadata_str}\\n{self.text}" if metadata_str else self.text
    ...
    ...     def set_content(self, value: str) -> None:
    ...         self.text = value
    ...
    ...     @property
    ...     def hash(self) -> str:
    ...         return hashlib.sha256(self.text.encode()).hexdigest()
    
    • Create a node with metadata
      >>> node = TextNode(
      ...     text="Hello world",
      ...     metadata={"page": 1, "author": "Alice"}
      ... )
      >>> node.get_type()
      'text'
      >>> node.get_content(metadata_mode=MetadataMode.NONE)
      'Hello world'
      
  • Using metadata exclusion for different contexts
    >>> node = TextNode(
    ...     text="Sensitive content",
    ...     metadata={"public": "yes", "internal_id": "secret123"},
    ...     excluded_llm_metadata_keys=["internal_id"]
    ... )
    
    • For LLM context (excludes internal_id)
      >>> content_for_llm = node.get_content(metadata_mode=MetadataMode.LLM)
      >>> "internal_id" in content_for_llm
      False
      >>> "public" in content_for_llm
      True
      
  • Setting up node relationships
    >>> parent = NodeInfo(id="parent-doc", type="document")
    >>> child = NodeInfo(id="child-chunk", type="text")
    >>>
    >>> node = TextNode(
    ...     text="Child content",
    ...     links={NodeType.PARENT: parent, NodeType.SOURCE: parent}
    ... )
    >>> node.linked_nodes.parent.id
    'parent-doc'
    >>> node.source_id
    'parent-doc'
    
  • Working with embeddings
    >>> node = TextNode(text="Sample text")
    >>> node.embedding = [0.1, 0.2, 0.3, 0.4, 0.5]
    >>> embedding_vec = node.get_embedding()
    >>> len(embedding_vec)
    5
    >>> embedding_vec[0]
    0.1
    
See Also

NodeInfo: Lightweight reference to a node. LinkedNodes: Container for node relationships. MetadataMode: Controls metadata inclusion in different contexts. SerializableModel: Base class providing serialization capabilities.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
class BaseNode(SerializableModel, ABC):
    r"""Abstract base class for document nodes with metadata and relationship management.

    BaseNode provides the foundational functionality for representing chunks of
    documents with rich metadata, embeddings, and hierarchical relationships. It
    supports selective metadata inclusion for different contexts (LLM vs embeddings),
    automatic change detection via hashing, and efficient relationship caching.

    Key features:
    - Automatic UUID generation for node identification
    - Metadata management with selective inclusion/exclusion for LLM and embedding contexts
    - Relationship tracking (source, parent, children, previous, next)
    - Embedding storage and retrieval
    - Cached LinkedNodes computation with automatic invalidation
    - Customizable metadata formatting and serialization

    Attributes:
        id: Unique identifier for the node (auto-generated UUID if not provided).
        embedding: Optional vector embedding for the node's content.
        metadata: Flat dictionary of metadata fields used for context and filtering.
        excluded_embed_metadata_keys: Metadata keys excluded from embedding context.
        excluded_llm_metadata_keys: Metadata keys excluded from LLM context.
        links: Dictionary mapping NodeType to NodeInfo for relationships.
        metadata_template: Template string for formatting metadata (default: "{key}: {value}").
        metadata_separator: Separator between metadata fields (default: newline).

    Note:
        This is an abstract base class. Subclasses must implement:
        - get_type(): Return the node's content type identifier
        - get_content(): Return the node's content with optional metadata
        - set_content(): Update the node's content
        - hash: Property returning the content hash for change detection

    Examples:
        - Creating a concrete node subclass
            ```python
            >>> from serapeum.core.base.embeddings.types import BaseNode, MetadataMode, NodeType, NodeInfo
            >>> import hashlib
            >>> from pydantic import Field
            >>>
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="", description="Text content of the node")
            ...
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...
            ...     def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
            ...         metadata_str = self.get_metadata_str(mode=metadata_mode)
            ...         return f"{metadata_str}\\n{self.text}" if metadata_str else self.text
            ...
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()

            ```
            - Create a node with metadata
            ```python
            >>> node = TextNode(
            ...     text="Hello world",
            ...     metadata={"page": 1, "author": "Alice"}
            ... )
            >>> node.get_type()
            'text'
            >>> node.get_content(metadata_mode=MetadataMode.NONE)
            'Hello world'

            ```
        - Using metadata exclusion for different contexts
            ```python
            >>> node = TextNode(
            ...     text="Sensitive content",
            ...     metadata={"public": "yes", "internal_id": "secret123"},
            ...     excluded_llm_metadata_keys=["internal_id"]
            ... )

            ```
            - For LLM context (excludes internal_id)
            ```python
            >>> content_for_llm = node.get_content(metadata_mode=MetadataMode.LLM)
            >>> "internal_id" in content_for_llm
            False
            >>> "public" in content_for_llm
            True

            ```
        - Setting up node relationships
            ```python
            >>> parent = NodeInfo(id="parent-doc", type="document")
            >>> child = NodeInfo(id="child-chunk", type="text")
            >>>
            >>> node = TextNode(
            ...     text="Child content",
            ...     links={NodeType.PARENT: parent, NodeType.SOURCE: parent}
            ... )
            >>> node.linked_nodes.parent.id
            'parent-doc'
            >>> node.source_id
            'parent-doc'

            ```
        - Working with embeddings
            ```python
            >>> node = TextNode(text="Sample text")
            >>> node.embedding = [0.1, 0.2, 0.3, 0.4, 0.5]
            >>> embedding_vec = node.get_embedding()
            >>> len(embedding_vec)
            5
            >>> embedding_vec[0]
            0.1

            ```

    See Also:
        NodeInfo: Lightweight reference to a node.
        LinkedNodes: Container for node relationships.
        MetadataMode: Controls metadata inclusion in different contexts.
        SerializableModel: Base class providing serialization capabilities.
    """

    # hash is computed on a local field, during the validation process
    model_config = ConfigDict(populate_by_name=True, validate_assignment=True)

    id: str = Field(
        default_factory=lambda: str(uuid.uuid4()), description="Unique ID of the node."
    )
    embedding: list[float] | None = Field(
        default=None, description="Embedding of the node."
    )

    metadata: dict[str, Any] = Field(
        default_factory=dict,
        description="A flat dictionary of metadata fields",
    )
    excluded_embed_metadata_keys: list[str] = Field(
        default_factory=list,
        description="Metadata keys that are excluded from text for the embed model.",
    )
    excluded_llm_metadata_keys: list[str] = Field(
        default_factory=list,
        description="Metadata keys that are excluded from text for the LLM.",
    )
    links: dict[
        Annotated[NodeType, EnumNameSerializer],
        NodeInfoType,
    ] = Field(
        default_factory=dict,
        description="A mapping of links to other nodes.",
    )
    metadata_template: str = Field(
        default=DEFAULT_METADATA_TMPL,
        description=(
            "Template for how metadata is formatted, with {key} and "
            "{value} placeholders."
        ),
    )
    metadata_separator: str = Field(
        default="\n",
        description="Separator between metadata fields when converting to string.",
    )

    linked_nodes_cache: LinkedNodes | None = Field(
        default=None,
        exclude=True,
        repr=False,
        description="Cached LinkedNodes object, invalidated when links change.",
    )

    # Track the links dict id to detect changes
    links_dict_id: int | None = Field(
        default=None,
        exclude=True,
        repr=False,
        description="ID of the links dict to detect when it's reassigned.",
    )

    @model_validator(mode="after")
    def _invalidate_linked_nodes_cache_on_links_change(self) -> "BaseNode":
        """Invalidate the linked_nodes cache when links dict is reassigned.

        This validator tracks the id of the links dict. When it changes
        (i.e., links is reassigned), the cache is cleared.

        Uses Pydantic v2's @model_validator with object.__setattr__ to avoid recursion.
        """
        current_links_id = id(self.links)

        # Check if links dict was reassigned (different id)
        if self.links_dict_id is None or self.links_dict_id != current_links_id:
            # Links changed, clear cache and update tracked id
            object.__setattr__(self, "linked_nodes_cache", None)
            object.__setattr__(self, "links_dict_id", current_links_id)

        return self

    @classmethod
    @abstractmethod
    def get_type(cls) -> str:
        """Get Object type."""

    @abstractmethod
    def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
        """Get object content."""

    def get_metadata_str(self, mode: MetadataMode = MetadataMode.ALL) -> str:
        """Metadata info string."""
        if mode == MetadataMode.NONE:
            return ""

        excluded = set()
        if mode == MetadataMode.LLM:
            excluded = set(self.excluded_llm_metadata_keys)
        elif mode == MetadataMode.EMBED:
            excluded = set(self.excluded_embed_metadata_keys)

        filtered = (
            self.metadata.items()
            if not excluded
            else (
                (key, value)
                for key, value in self.metadata.items()
                if key not in excluded
            )
        )
        return self.metadata_separator.join(
            self.metadata_template.format(key=key, value=str(value))
            for key, value in filtered
        )

    @abstractmethod
    def set_content(self, value: Any) -> None:
        """Set the content of the node."""

    @property
    @abstractmethod
    def hash(self) -> str:
        """Get hash of node."""

    @property
    def source_id(self) -> str | None:
        return self.linked_nodes.source_id

    def _clear_linked_nodes_cache(self) -> None:
        """Manually clear the linked_nodes cache.

        Call this method if you mutate the links dict in-place.
        This is necessary because Pydantic's field validators only trigger
        on field assignment, not on in-place mutations.

        Examples:
            >>> from serapeum.core.base.embeddings.types import BaseNode, NodeInfo, NodeType, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> node = TextNode(text="Sample", links={})
            >>> new_source = NodeInfo(id="updated-source", type="document")
            >>> node.links[NodeType.SOURCE] = new_source
            >>> node._clear_linked_nodes_cache()
            >>> node.linked_nodes.source.id
            'updated-source'
        """
        self.linked_nodes_cache = None

    @property
    def linked_nodes(self) -> LinkedNodes:
        """Get linked nodes from the links dictionary.

        This property validates and converts the links dictionary into a
        LinkedNodes object. The result is cached and automatically invalidated
        when the links field is reassigned through Pydantic's field validation.

        Returns:
            LinkedNodes: A validated and cached LinkedNodes object.

        Note:
            - Cache is automatically cleared when `links` is reassigned
            - For in-place mutations (e.g., node.links[key] = value), you must
              either reassign the entire dict OR call _clear_linked_nodes_cache()
            - Uses Pydantic's @field_validator to manage cache invalidation

        Examples:
            >>> from serapeum.core.base.embeddings.types import BaseNode, NodeInfo, NodeType, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> node = TextNode(text="Sample")
            >>> source_ref = NodeInfo(id="doc-123", type="document")
            >>> node.links = {NodeType.SOURCE: source_ref}
            >>> node.linked_nodes.source.id
            'doc-123'

            >>> node = TextNode(text="Sample", links={})
            >>> prev_ref = NodeInfo(id="prev-chunk", type="text")
            >>> node.links[NodeType.PREVIOUS] = prev_ref
            >>> node._clear_linked_nodes_cache()
            >>> node.linked_nodes.previous.id
            'prev-chunk'

            >>> node = TextNode(text="Sample")
            >>> parent = NodeInfo(id="parent-1", type="document")
            >>> child1 = NodeInfo(id="child-1", type="text")
            >>> child2 = NodeInfo(id="child-2", type="text")
            >>> node.links = {NodeType.PARENT: parent, NodeType.CHILD: [child1, child2]}
            >>> node.linked_nodes.parent.id
            'parent-1'
            >>> [c.id for c in node.linked_nodes.children]
            ['child-1', 'child-2']
        """
        if self.linked_nodes_cache is None:
            # Compute and cache the LinkedNodes
            self.linked_nodes_cache = LinkedNodes.create(self.links)
        return self.linked_nodes_cache

    def __str__(self) -> str:
        """STR."""
        source_text_truncated = truncate_text(
            self.get_content().strip(), TRUNCATE_LENGTH
        )
        source_text_wrapped = textwrap.fill(
            f"Text: {source_text_truncated}\n", width=WRAP_WIDTH
        )
        return f"Node ID: {self.id}\n{source_text_wrapped}"

    def get_embedding(self) -> list[float]:
        """Get embedding.

        Raises:
            ValueError: if embedding is None.
        """
        if self.embedding is None:
            raise ValueError("embedding not set.")
        return self.embedding

    def get_node_info(self) -> NodeInfo:
        """Get node info."""
        return NodeInfo(
            id=self.id,
            type=self.get_type(),
            metadata=self.metadata,
            hash=self.hash,
        )

linked_nodes property #

Get linked nodes from the links dictionary.

This property validates and converts the links dictionary into a LinkedNodes object. The result is cached and automatically invalidated when the links field is reassigned through Pydantic's field validation.

Returns:

Name Type Description
LinkedNodes LinkedNodes

A validated and cached LinkedNodes object.

Note
  • Cache is automatically cleared when links is reassigned
  • For in-place mutations (e.g., node.links[key] = value), you must either reassign the entire dict OR call _clear_linked_nodes_cache()
  • Uses Pydantic's @field_validator to manage cache invalidation

Examples:

>>> from serapeum.core.base.embeddings.types import BaseNode, NodeInfo, NodeType, MetadataMode
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> node = TextNode(text="Sample")
>>> source_ref = NodeInfo(id="doc-123", type="document")
>>> node.links = {NodeType.SOURCE: source_ref}
>>> node.linked_nodes.source.id
'doc-123'
>>> node = TextNode(text="Sample", links={})
>>> prev_ref = NodeInfo(id="prev-chunk", type="text")
>>> node.links[NodeType.PREVIOUS] = prev_ref
>>> node._clear_linked_nodes_cache()
>>> node.linked_nodes.previous.id
'prev-chunk'
>>> node = TextNode(text="Sample")
>>> parent = NodeInfo(id="parent-1", type="document")
>>> child1 = NodeInfo(id="child-1", type="text")
>>> child2 = NodeInfo(id="child-2", type="text")
>>> node.links = {NodeType.PARENT: parent, NodeType.CHILD: [child1, child2]}
>>> node.linked_nodes.parent.id
'parent-1'
>>> [c.id for c in node.linked_nodes.children]
['child-1', 'child-2']

__str__() #

STR.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
def __str__(self) -> str:
    """STR."""
    source_text_truncated = truncate_text(
        self.get_content().strip(), TRUNCATE_LENGTH
    )
    source_text_wrapped = textwrap.fill(
        f"Text: {source_text_truncated}\n", width=WRAP_WIDTH
    )
    return f"Node ID: {self.id}\n{source_text_wrapped}"

get_content(metadata_mode=MetadataMode.ALL) abstractmethod #

Get object content.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@abstractmethod
def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
    """Get object content."""

get_embedding() #

Get embedding.

Raises:

Type Description
ValueError

if embedding is None.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
def get_embedding(self) -> list[float]:
    """Get embedding.

    Raises:
        ValueError: if embedding is None.
    """
    if self.embedding is None:
        raise ValueError("embedding not set.")
    return self.embedding

get_metadata_str(mode=MetadataMode.ALL) #

Metadata info string.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
def get_metadata_str(self, mode: MetadataMode = MetadataMode.ALL) -> str:
    """Metadata info string."""
    if mode == MetadataMode.NONE:
        return ""

    excluded = set()
    if mode == MetadataMode.LLM:
        excluded = set(self.excluded_llm_metadata_keys)
    elif mode == MetadataMode.EMBED:
        excluded = set(self.excluded_embed_metadata_keys)

    filtered = (
        self.metadata.items()
        if not excluded
        else (
            (key, value)
            for key, value in self.metadata.items()
            if key not in excluded
        )
    )
    return self.metadata_separator.join(
        self.metadata_template.format(key=key, value=str(value))
        for key, value in filtered
    )

get_node_info() #

Get node info.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
def get_node_info(self) -> NodeInfo:
    """Get node info."""
    return NodeInfo(
        id=self.id,
        type=self.get_type(),
        metadata=self.metadata,
        hash=self.hash,
    )

get_type() abstractmethod classmethod #

Get Object type.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@classmethod
@abstractmethod
def get_type(cls) -> str:
    """Get Object type."""

set_content(value) abstractmethod #

Set the content of the node.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@abstractmethod
def set_content(self, value: Any) -> None:
    """Set the content of the node."""

CallMixin #

Bases: ABC

Base class for node transformation components.

CallMixin defines the interface for components that transform sequences of nodes, such as embedders, parsers, or metadata enrichers. It provides both synchronous and asynchronous calling interfaces.

The mixin uses callable syntax (obj(nodes)) for synchronous transforms and obj.acall(nodes) for asynchronous transforms, enabling composable pipelines.

Attributes:

Name Type Description
model_config

Pydantic configuration allowing arbitrary types in subclasses.

Examples:

>>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
>>> from typing import Sequence, Any
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> class UppercaseTransform(CallMixin):
...     def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
...         result = []
...         for node in nodes:
...             node.set_content(node.get_content().upper())
...             result.append(node)
...         return result
>>> transformer = UppercaseTransform()
>>> nodes = [TextNode(text="hello"), TextNode(text="world")]
>>> transformed = transformer(nodes)
>>> transformed[0].get_content()
'HELLO'
>>> transformed[1].get_content()
'WORLD'
See Also

BaseEmbedding: Uses CallMixin to enable embedding nodes. BaseNode: The node type that this mixin transforms.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
class CallMixin(ABC):
    """Base class for node transformation components.

    CallMixin defines the interface for components that transform sequences of nodes,
    such as embedders, parsers, or metadata enrichers. It provides both synchronous
    and asynchronous calling interfaces.

    The mixin uses callable syntax (`obj(nodes)`) for synchronous transforms and
    `obj.acall(nodes)` for asynchronous transforms, enabling composable pipelines.

    Attributes:
        model_config: Pydantic configuration allowing arbitrary types in subclasses.

    Examples:
        >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
        >>> from typing import Sequence, Any
        >>> import hashlib
        >>> from pydantic import Field
        >>> class TextNode(BaseNode):
        ...     text: str = Field(default="")
        ...     @classmethod
        ...     def get_type(cls) -> str:
        ...         return "text"
        ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
        ...         return self.text
        ...     def set_content(self, value: str) -> None:
        ...         self.text = value
        ...     @property
        ...     def hash(self) -> str:
        ...         return hashlib.sha256(self.text.encode()).hexdigest()
        >>> class UppercaseTransform(CallMixin):
        ...     def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
        ...         result = []
        ...         for node in nodes:
        ...             node.set_content(node.get_content().upper())
        ...             result.append(node)
        ...         return result
        >>> transformer = UppercaseTransform()
        >>> nodes = [TextNode(text="hello"), TextNode(text="world")]
        >>> transformed = transformer(nodes)
        >>> transformed[0].get_content()
        'HELLO'
        >>> transformed[1].get_content()
        'WORLD'

    See Also:
        BaseEmbedding: Uses CallMixin to enable embedding nodes.
        BaseNode: The node type that this mixin transforms.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    @abstractmethod
    def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
        """Transform a sequence of nodes synchronously.

        Subclasses must implement this method to define their transformation logic.
        This method is called when the object is invoked directly: `obj(nodes)`.

        Args:
            nodes: Sequence of BaseNode instances to transform.
            **kwargs: Additional keyword arguments specific to the transformation.

        Returns:
            Transformed sequence of BaseNode instances.

        Examples:
            >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> class MetadataAdder(CallMixin):
            ...     def __call__(self, nodes, **kwargs):
            ...         result = []
            ...         for i, node in enumerate(nodes):
            ...             node.metadata["index"] = i
            ...             result.append(node)
            ...         return result
            >>> adder = MetadataAdder()
            >>> nodes = [TextNode(text="first"), TextNode(text="second")]
            >>> processed = adder(nodes)
            >>> processed[0].metadata["index"]
            0
            >>> processed[1].metadata["index"]
            1

        Note:
            Implementations should preserve node identity where possible and
            avoid mutating input nodes unless explicitly documented.
        """

    async def acall(
        self, nodes: Sequence[BaseNode], **kwargs: Any
    ) -> Sequence[BaseNode]:
        """Transform a sequence of nodes asynchronously.

        Default implementation delegates to synchronous `__call__`. Subclasses
        can override this for true async implementations (e.g., async API calls).

        Args:
            nodes: Sequence of BaseNode instances to transform.
            **kwargs: Additional keyword arguments specific to the transformation.

        Returns:
            Transformed sequence of BaseNode instances.

        Examples:
            >>> import asyncio
            >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
            >>> import hashlib
            >>> from pydantic import Field
            >>> class TextNode(BaseNode):
            ...     text: str = Field(default="")
            ...     @classmethod
            ...     def get_type(cls) -> str:
            ...         return "text"
            ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
            ...         return self.text
            ...     def set_content(self, value: str) -> None:
            ...         self.text = value
            ...     @property
            ...     def hash(self) -> str:
            ...         return hashlib.sha256(self.text.encode()).hexdigest()
            >>> class AsyncTransform(CallMixin):
            ...     def __call__(self, nodes, **kwargs):
            ...         return nodes
            ...     async def acall(self, nodes, **kwargs):
            ...         await asyncio.sleep(0)
            ...         for node in nodes:
            ...             node.metadata["async_processed"] = True
            ...         return nodes
            >>> transform = AsyncTransform()
            >>> nodes = [TextNode(text="test")]
            >>> result = asyncio.run(transform.acall(nodes))
            >>> result[0].metadata["async_processed"]
            True

            >>> class SyncOnlyTransform(CallMixin):
            ...     def __call__(self, nodes, **kwargs):
            ...         for node in nodes:
            ...             node.metadata["processed"] = True
            ...         return nodes
            >>> sync_transform = SyncOnlyTransform()
            >>> nodes = [TextNode(text="test")]
            >>> result = asyncio.run(sync_transform.acall(nodes))
            >>> result[0].metadata["processed"]
            True

        Note:
            If no true async implementation is needed, the default delegation
            to `__call__` is sufficient. Override only if the transformation
            benefits from async/await (e.g., I/O operations).
        """
        return self.__call__(nodes, **kwargs)

__call__(nodes, **kwargs) abstractmethod #

Transform a sequence of nodes synchronously.

Subclasses must implement this method to define their transformation logic. This method is called when the object is invoked directly: obj(nodes).

Parameters:

Name Type Description Default
nodes Sequence[BaseNode]

Sequence of BaseNode instances to transform.

required
**kwargs Any

Additional keyword arguments specific to the transformation.

{}

Returns:

Type Description
Sequence[BaseNode]

Transformed sequence of BaseNode instances.

Examples:

>>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> class MetadataAdder(CallMixin):
...     def __call__(self, nodes, **kwargs):
...         result = []
...         for i, node in enumerate(nodes):
...             node.metadata["index"] = i
...             result.append(node)
...         return result
>>> adder = MetadataAdder()
>>> nodes = [TextNode(text="first"), TextNode(text="second")]
>>> processed = adder(nodes)
>>> processed[0].metadata["index"]
0
>>> processed[1].metadata["index"]
1
Note

Implementations should preserve node identity where possible and avoid mutating input nodes unless explicitly documented.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@abstractmethod
def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> Sequence[BaseNode]:
    """Transform a sequence of nodes synchronously.

    Subclasses must implement this method to define their transformation logic.
    This method is called when the object is invoked directly: `obj(nodes)`.

    Args:
        nodes: Sequence of BaseNode instances to transform.
        **kwargs: Additional keyword arguments specific to the transformation.

    Returns:
        Transformed sequence of BaseNode instances.

    Examples:
        >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
        >>> import hashlib
        >>> from pydantic import Field
        >>> class TextNode(BaseNode):
        ...     text: str = Field(default="")
        ...     @classmethod
        ...     def get_type(cls) -> str:
        ...         return "text"
        ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
        ...         return self.text
        ...     def set_content(self, value: str) -> None:
        ...         self.text = value
        ...     @property
        ...     def hash(self) -> str:
        ...         return hashlib.sha256(self.text.encode()).hexdigest()
        >>> class MetadataAdder(CallMixin):
        ...     def __call__(self, nodes, **kwargs):
        ...         result = []
        ...         for i, node in enumerate(nodes):
        ...             node.metadata["index"] = i
        ...             result.append(node)
        ...         return result
        >>> adder = MetadataAdder()
        >>> nodes = [TextNode(text="first"), TextNode(text="second")]
        >>> processed = adder(nodes)
        >>> processed[0].metadata["index"]
        0
        >>> processed[1].metadata["index"]
        1

    Note:
        Implementations should preserve node identity where possible and
        avoid mutating input nodes unless explicitly documented.
    """

acall(nodes, **kwargs) async #

Transform a sequence of nodes asynchronously.

Default implementation delegates to synchronous __call__. Subclasses can override this for true async implementations (e.g., async API calls).

Parameters:

Name Type Description Default
nodes Sequence[BaseNode]

Sequence of BaseNode instances to transform.

required
**kwargs Any

Additional keyword arguments specific to the transformation.

{}

Returns:

Type Description
Sequence[BaseNode]

Transformed sequence of BaseNode instances.

Examples:

>>> import asyncio
>>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
>>> import hashlib
>>> from pydantic import Field
>>> class TextNode(BaseNode):
...     text: str = Field(default="")
...     @classmethod
...     def get_type(cls) -> str:
...         return "text"
...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
...         return self.text
...     def set_content(self, value: str) -> None:
...         self.text = value
...     @property
...     def hash(self) -> str:
...         return hashlib.sha256(self.text.encode()).hexdigest()
>>> class AsyncTransform(CallMixin):
...     def __call__(self, nodes, **kwargs):
...         return nodes
...     async def acall(self, nodes, **kwargs):
...         await asyncio.sleep(0)
...         for node in nodes:
...             node.metadata["async_processed"] = True
...         return nodes
>>> transform = AsyncTransform()
>>> nodes = [TextNode(text="test")]
>>> result = asyncio.run(transform.acall(nodes))
>>> result[0].metadata["async_processed"]
True
>>> class SyncOnlyTransform(CallMixin):
...     def __call__(self, nodes, **kwargs):
...         for node in nodes:
...             node.metadata["processed"] = True
...         return nodes
>>> sync_transform = SyncOnlyTransform()
>>> nodes = [TextNode(text="test")]
>>> result = asyncio.run(sync_transform.acall(nodes))
>>> result[0].metadata["processed"]
True
Note

If no true async implementation is needed, the default delegation to __call__ is sufficient. Override only if the transformation benefits from async/await (e.g., I/O operations).

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
async def acall(
    self, nodes: Sequence[BaseNode], **kwargs: Any
) -> Sequence[BaseNode]:
    """Transform a sequence of nodes asynchronously.

    Default implementation delegates to synchronous `__call__`. Subclasses
    can override this for true async implementations (e.g., async API calls).

    Args:
        nodes: Sequence of BaseNode instances to transform.
        **kwargs: Additional keyword arguments specific to the transformation.

    Returns:
        Transformed sequence of BaseNode instances.

    Examples:
        >>> import asyncio
        >>> from serapeum.core.base.embeddings.types import CallMixin, BaseNode, MetadataMode
        >>> import hashlib
        >>> from pydantic import Field
        >>> class TextNode(BaseNode):
        ...     text: str = Field(default="")
        ...     @classmethod
        ...     def get_type(cls) -> str:
        ...         return "text"
        ...     def get_content(self, metadata_mode=MetadataMode.ALL) -> str:
        ...         return self.text
        ...     def set_content(self, value: str) -> None:
        ...         self.text = value
        ...     @property
        ...     def hash(self) -> str:
        ...         return hashlib.sha256(self.text.encode()).hexdigest()
        >>> class AsyncTransform(CallMixin):
        ...     def __call__(self, nodes, **kwargs):
        ...         return nodes
        ...     async def acall(self, nodes, **kwargs):
        ...         await asyncio.sleep(0)
        ...         for node in nodes:
        ...             node.metadata["async_processed"] = True
        ...         return nodes
        >>> transform = AsyncTransform()
        >>> nodes = [TextNode(text="test")]
        >>> result = asyncio.run(transform.acall(nodes))
        >>> result[0].metadata["async_processed"]
        True

        >>> class SyncOnlyTransform(CallMixin):
        ...     def __call__(self, nodes, **kwargs):
        ...         for node in nodes:
        ...             node.metadata["processed"] = True
        ...         return nodes
        >>> sync_transform = SyncOnlyTransform()
        >>> nodes = [TextNode(text="test")]
        >>> result = asyncio.run(sync_transform.acall(nodes))
        >>> result[0].metadata["processed"]
        True

    Note:
        If no true async implementation is needed, the default delegation
        to `__call__` is sufficient. Override only if the transformation
        benefits from async/await (e.g., I/O operations).
    """
    return self.__call__(nodes, **kwargs)

LinkedNodes #

Bases: SerializableModel

Immutable container for node relationships in a document hierarchy.

LinkedNodes manages references between nodes in a document structure, supporting linear sequences (previous/next), hierarchical relationships (parent/children), and source document tracking. The model is frozen to prevent accidental mutation of relationship structures.

Attributes:

Name Type Description
source NodeInfo | None

Reference to the original source document node.

previous NodeInfo | None

Reference to the previous node in a sequence.

next NodeInfo | None

Reference to the next node in a sequence.

parent NodeInfo | None

Reference to the parent node in a hierarchy.

children list[NodeInfo] | None

List of child node references in a hierarchy.

Examples:

  • Creating a linear sequence of nodes
    >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
    >>> prev_node = NodeInfo(id="chunk-1")
    >>> next_node = NodeInfo(id="chunk-3")
    >>> links = LinkedNodes(previous=prev_node, next=next_node)
    >>> links.previous.id
    'chunk-1'
    
  • Building hierarchical relationships
    >>> parent = NodeInfo(id="section-1")
    >>> child1 = NodeInfo(id="para-1")
    >>> child2 = NodeInfo(id="para-2")
    >>> links = LinkedNodes(parent=parent, children=[child1, child2])
    >>> [c.id for c in links.children]
    ['para-1', 'para-2']
    
  • Using factory method with NodeType enum
    >>> from serapeum.core.base.embeddings.types import NodeType
    >>> source = NodeInfo(id="doc-main")
    >>> links_dict = {NodeType.SOURCE: source}
    >>> links = LinkedNodes.create(links_dict)
    >>> links.source.id
    'doc-main'
    
  • Accessing source ID property
    >>> source = NodeInfo(id="original-doc")
    >>> links = LinkedNodes(source=source)
    >>> links.source_id
    'original-doc'
    
See Also

NodeType: Enum defining relationship types. NodeInfo: References stored in relationship fields. BaseNode.linked_nodes: Property that creates LinkedNodes from links dict.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
class LinkedNodes(SerializableModel):
    """Immutable container for node relationships in a document hierarchy.

    LinkedNodes manages references between nodes in a document structure, supporting
    linear sequences (previous/next), hierarchical relationships (parent/children),
    and source document tracking. The model is frozen to prevent accidental mutation
    of relationship structures.

    Attributes:
        source: Reference to the original source document node.
        previous: Reference to the previous node in a sequence.
        next: Reference to the next node in a sequence.
        parent: Reference to the parent node in a hierarchy.
        children: List of child node references in a hierarchy.

    Examples:
        - Creating a linear sequence of nodes
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
            >>> prev_node = NodeInfo(id="chunk-1")
            >>> next_node = NodeInfo(id="chunk-3")
            >>> links = LinkedNodes(previous=prev_node, next=next_node)
            >>> links.previous.id
            'chunk-1'

            ```
        - Building hierarchical relationships
            ```python
            >>> parent = NodeInfo(id="section-1")
            >>> child1 = NodeInfo(id="para-1")
            >>> child2 = NodeInfo(id="para-2")
            >>> links = LinkedNodes(parent=parent, children=[child1, child2])
            >>> [c.id for c in links.children]
            ['para-1', 'para-2']

            ```
        - Using factory method with NodeType enum
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeType
            >>> source = NodeInfo(id="doc-main")
            >>> links_dict = {NodeType.SOURCE: source}
            >>> links = LinkedNodes.create(links_dict)
            >>> links.source.id
            'doc-main'

            ```
        - Accessing source ID property
            ```python
            >>> source = NodeInfo(id="original-doc")
            >>> links = LinkedNodes(source=source)
            >>> links.source_id
            'original-doc'

            ```

    See Also:
        NodeType: Enum defining relationship types.
        NodeInfo: References stored in relationship fields.
        BaseNode.linked_nodes: Property that creates LinkedNodes from links dict.
    """

    model_config = ConfigDict(frozen=True)

    source: NodeInfo | None = None
    previous: NodeInfo | None = None
    next: NodeInfo | None = None
    parent: NodeInfo | None = None
    children: list[NodeInfo] | None = None

    @field_validator("source", "previous", "next", "parent")
    @classmethod
    def validate_single_node(cls, v: Any) -> NodeInfo | None:
        """Validate that single-node fields contain NodeInfo objects.

        Ensures that source, previous, next, and parent fields contain exactly
        one NodeInfo instance (not a list). Called automatically by Pydantic
        during model instantiation and validation.

        Args:
            v: Value to validate, expected to be NodeInfo or None.

        Returns:
            The validated NodeInfo instance or None.

        Raises:
            ValueError: If v is not None and not a NodeInfo instance.

        Examples:
            - Valid single node assignment
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
                >>> node = NodeInfo(id="valid")
                >>> links = LinkedNodes(source=node)
                >>> links.source.id
                'valid'

                ```
            - Invalid list assignment to single-node field raises ValidationError
                ```python
                >>> from pydantic import ValidationError
                >>> try:
                ...     LinkedNodes(source=[NodeInfo(id="bad")])
                ... except ValidationError as e:
                ...     'source' in str(e)
                True

                ```

        Note:
            This validator applies to: source, previous, next, parent fields.
            The children field has a separate validator for list validation.
        """
        if v is not None and not isinstance(v, NodeInfo):
            raise ValueError("Must be a NodeInfo object, not a list")
        return v

    @field_validator("children")
    @classmethod
    def validate_children_list(cls, v: Any) -> list[NodeInfo] | None:
        """Validate that children field contains a list of NodeInfo objects.

        Ensures the children field is a list (not a single NodeInfo instance).
        Called automatically by Pydantic during model instantiation and validation.

        Args:
            v: Value to validate, expected to be list[NodeInfo] or None.

        Returns:
            The validated list of NodeInfo instances or None.

        Raises:
            ValueError: If v is not None and not a list.

        Examples:
            - Valid children list
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
                >>> child1 = NodeInfo(id="child-1")
                >>> child2 = NodeInfo(id="child-2")
                >>> links = LinkedNodes(children=[child1, child2])
                >>> [c.id for c in links.children]
                ['child-1', 'child-2']

                ```
            - Invalid single NodeInfo for children raises ValidationError
                ```python
                >>> from pydantic import ValidationError
                >>> try:
                ...     LinkedNodes(children=NodeInfo(id="bad"))
                ... except ValidationError as e:
                ...     'children' in str(e)
                True

                ```
            - Empty children list is valid
                ```python
                >>> links = LinkedNodes(children=[])
                >>> links.children
                []

                ```

        Note:
            This validator is specific to the children field, which represents
            one-to-many relationships.
        """
        if v is not None and not isinstance(v, list):
            raise ValueError("Children must be a list of NodeInfo objects")
        return v

    @classmethod
    def create(cls, linked_nodes_info: dict[NodeType, NodeInfoType]) -> "LinkedNodes":
        """Create LinkedNodes from a dict mapping NodeType to NodeInfo/list.

        Factory method that converts a dictionary with NodeType keys into a
        validated LinkedNodes instance. Pydantic validators automatically check
        that single-node fields contain NodeInfo and children contains a list.

        Args:
            linked_nodes_info: Dictionary mapping NodeType enum values to either
                NodeInfo (for single relationships) or list[NodeInfo] (for
                children). Missing keys are treated as None.

        Returns:
            A new LinkedNodes instance with validated relationships.

        Raises:
            ValueError: If a single-node field (SOURCE, PREVIOUS, NEXT, PARENT)
                receives a list, or if children receives a non-list value.

        Examples:
            - Creating from a dict with mixed relationships
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
                >>> source = NodeInfo(id="doc-1")
                >>> parent = NodeInfo(id="section-1")
                >>> children = [NodeInfo(id="para-1"), NodeInfo(id="para-2")]
                >>> links_dict = {
                ...     NodeType.SOURCE: source,
                ...     NodeType.PARENT: parent,
                ...     NodeType.CHILD: children
                ... }
                >>> links = LinkedNodes.create(links_dict)
                >>> links.source.id
                'doc-1'

                ```
            - Creating with only some relationships
                ```python
                >>> prev = NodeInfo(id="chunk-1")
                >>> next_node = NodeInfo(id="chunk-3")
                >>> links = LinkedNodes.create({
                ...     NodeType.PREVIOUS: prev,
                ...     NodeType.NEXT: next_node
                ... })
                >>> links.previous.id
                'chunk-1'

                ```
            - Empty dict creates all-None instance
                ```python
                >>> links = LinkedNodes.create({})
                >>> links.source is None
                True

                ```

        See Also:
            LinkedNodes.as_dict: Inverse operation converting LinkedNodes to dict.
            NodeType: Enum defining valid relationship types.
        """
        return cls(
            source=linked_nodes_info.get(NodeType.SOURCE),
            previous=linked_nodes_info.get(NodeType.PREVIOUS),
            next=linked_nodes_info.get(NodeType.NEXT),
            parent=linked_nodes_info.get(NodeType.PARENT),
            children=linked_nodes_info.get(NodeType.CHILD),
        )

    def as_dict(self) -> dict[NodeType, NodeInfoType | None]:
        """Convert LinkedNodes to a dictionary mapping NodeType to NodeInfo.

        Creates a dictionary representation with NodeType enum keys and NodeInfo
        values. None values are excluded from the result to create a compact
        representation containing only active relationships.

        Returns:
            Dictionary with NodeType keys and NodeInfo/list[NodeInfo] values.
            Only non-None relationships are included.

        Examples:
            - Converting to dict with multiple relationships
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
                >>> source = NodeInfo(id="doc-1")
                >>> parent = NodeInfo(id="section-1")
                >>> links = LinkedNodes(source=source, parent=parent)
                >>> result = links.as_dict()
                >>> result[NodeType.SOURCE].id
                'doc-1'

                ```
            - None values are excluded
                ```python
                >>> links = LinkedNodes(source=NodeInfo(id="doc-1"))
                >>> result = links.as_dict()
                >>> NodeType.PREVIOUS in result
                False

                ```
            - Round-trip with create method
                ```python
                >>> original = LinkedNodes(
                ...     source=NodeInfo(id="src"),
                ...     children=[NodeInfo(id="child-1")]
                ... )
                >>> as_dict = original.as_dict()
                >>> restored = LinkedNodes.create(as_dict)
                >>> restored.source.id
                'src'

                ```

        See Also:
            LinkedNodes.create: Factory method for creating from dict.
            BaseNode.links: Uses this format for storing relationships.
        """
        linked_nodes = {
            NodeType.SOURCE: self.source,
            NodeType.PREVIOUS: self.previous,
            NodeType.NEXT: self.next,
            NodeType.PARENT: self.parent,
            NodeType.CHILD: self.children,
        }

        linked_nodes = {
            key: value for key, value in linked_nodes.items() if value is not None
        }
        return linked_nodes

    @property
    def source_id(self) -> str | None:
        """Get the ID of the source node if it exists.

        Convenience property for accessing the source node's ID without
        checking if source is None first.

        Returns:
            The source node's ID string, or None if no source is set.

        Examples:
            - Accessing source ID when source exists
                ```python
                >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
                >>> source = NodeInfo(id="document-123")
                >>> links = LinkedNodes(source=source)
                >>> links.source_id
                'document-123'

                ```
            - Accessing when source is None
                ```python
                >>> links = LinkedNodes()
                >>> links.source_id is None
                True

                ```

        See Also:
            BaseNode.source_id: Uses this property for node source tracking.
        """
        source_id = None
        if self.source is not None:
            source_id = self.source.id
        return source_id

source_id property #

Get the ID of the source node if it exists.

Convenience property for accessing the source node's ID without checking if source is None first.

Returns:

Type Description
str | None

The source node's ID string, or None if no source is set.

Examples:

  • Accessing source ID when source exists
    >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
    >>> source = NodeInfo(id="document-123")
    >>> links = LinkedNodes(source=source)
    >>> links.source_id
    'document-123'
    
  • Accessing when source is None
    >>> links = LinkedNodes()
    >>> links.source_id is None
    True
    
See Also

BaseNode.source_id: Uses this property for node source tracking.

as_dict() #

Convert LinkedNodes to a dictionary mapping NodeType to NodeInfo.

Creates a dictionary representation with NodeType enum keys and NodeInfo values. None values are excluded from the result to create a compact representation containing only active relationships.

Returns:

Type Description
dict[NodeType, NodeInfoType | None]

Dictionary with NodeType keys and NodeInfo/list[NodeInfo] values.

dict[NodeType, NodeInfoType | None]

Only non-None relationships are included.

Examples:

  • Converting to dict with multiple relationships
    >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
    >>> source = NodeInfo(id="doc-1")
    >>> parent = NodeInfo(id="section-1")
    >>> links = LinkedNodes(source=source, parent=parent)
    >>> result = links.as_dict()
    >>> result[NodeType.SOURCE].id
    'doc-1'
    
  • None values are excluded
    >>> links = LinkedNodes(source=NodeInfo(id="doc-1"))
    >>> result = links.as_dict()
    >>> NodeType.PREVIOUS in result
    False
    
  • Round-trip with create method
    >>> original = LinkedNodes(
    ...     source=NodeInfo(id="src"),
    ...     children=[NodeInfo(id="child-1")]
    ... )
    >>> as_dict = original.as_dict()
    >>> restored = LinkedNodes.create(as_dict)
    >>> restored.source.id
    'src'
    
See Also

LinkedNodes.create: Factory method for creating from dict. BaseNode.links: Uses this format for storing relationships.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
def as_dict(self) -> dict[NodeType, NodeInfoType | None]:
    """Convert LinkedNodes to a dictionary mapping NodeType to NodeInfo.

    Creates a dictionary representation with NodeType enum keys and NodeInfo
    values. None values are excluded from the result to create a compact
    representation containing only active relationships.

    Returns:
        Dictionary with NodeType keys and NodeInfo/list[NodeInfo] values.
        Only non-None relationships are included.

    Examples:
        - Converting to dict with multiple relationships
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
            >>> source = NodeInfo(id="doc-1")
            >>> parent = NodeInfo(id="section-1")
            >>> links = LinkedNodes(source=source, parent=parent)
            >>> result = links.as_dict()
            >>> result[NodeType.SOURCE].id
            'doc-1'

            ```
        - None values are excluded
            ```python
            >>> links = LinkedNodes(source=NodeInfo(id="doc-1"))
            >>> result = links.as_dict()
            >>> NodeType.PREVIOUS in result
            False

            ```
        - Round-trip with create method
            ```python
            >>> original = LinkedNodes(
            ...     source=NodeInfo(id="src"),
            ...     children=[NodeInfo(id="child-1")]
            ... )
            >>> as_dict = original.as_dict()
            >>> restored = LinkedNodes.create(as_dict)
            >>> restored.source.id
            'src'

            ```

    See Also:
        LinkedNodes.create: Factory method for creating from dict.
        BaseNode.links: Uses this format for storing relationships.
    """
    linked_nodes = {
        NodeType.SOURCE: self.source,
        NodeType.PREVIOUS: self.previous,
        NodeType.NEXT: self.next,
        NodeType.PARENT: self.parent,
        NodeType.CHILD: self.children,
    }

    linked_nodes = {
        key: value for key, value in linked_nodes.items() if value is not None
    }
    return linked_nodes

create(linked_nodes_info) classmethod #

Create LinkedNodes from a dict mapping NodeType to NodeInfo/list.

Factory method that converts a dictionary with NodeType keys into a validated LinkedNodes instance. Pydantic validators automatically check that single-node fields contain NodeInfo and children contains a list.

Parameters:

Name Type Description Default
linked_nodes_info dict[NodeType, NodeInfoType]

Dictionary mapping NodeType enum values to either NodeInfo (for single relationships) or list[NodeInfo] (for children). Missing keys are treated as None.

required

Returns:

Type Description
'LinkedNodes'

A new LinkedNodes instance with validated relationships.

Raises:

Type Description
ValueError

If a single-node field (SOURCE, PREVIOUS, NEXT, PARENT) receives a list, or if children receives a non-list value.

Examples:

  • Creating from a dict with mixed relationships
    >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
    >>> source = NodeInfo(id="doc-1")
    >>> parent = NodeInfo(id="section-1")
    >>> children = [NodeInfo(id="para-1"), NodeInfo(id="para-2")]
    >>> links_dict = {
    ...     NodeType.SOURCE: source,
    ...     NodeType.PARENT: parent,
    ...     NodeType.CHILD: children
    ... }
    >>> links = LinkedNodes.create(links_dict)
    >>> links.source.id
    'doc-1'
    
  • Creating with only some relationships
    >>> prev = NodeInfo(id="chunk-1")
    >>> next_node = NodeInfo(id="chunk-3")
    >>> links = LinkedNodes.create({
    ...     NodeType.PREVIOUS: prev,
    ...     NodeType.NEXT: next_node
    ... })
    >>> links.previous.id
    'chunk-1'
    
  • Empty dict creates all-None instance
    >>> links = LinkedNodes.create({})
    >>> links.source is None
    True
    
See Also

LinkedNodes.as_dict: Inverse operation converting LinkedNodes to dict. NodeType: Enum defining valid relationship types.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@classmethod
def create(cls, linked_nodes_info: dict[NodeType, NodeInfoType]) -> "LinkedNodes":
    """Create LinkedNodes from a dict mapping NodeType to NodeInfo/list.

    Factory method that converts a dictionary with NodeType keys into a
    validated LinkedNodes instance. Pydantic validators automatically check
    that single-node fields contain NodeInfo and children contains a list.

    Args:
        linked_nodes_info: Dictionary mapping NodeType enum values to either
            NodeInfo (for single relationships) or list[NodeInfo] (for
            children). Missing keys are treated as None.

    Returns:
        A new LinkedNodes instance with validated relationships.

    Raises:
        ValueError: If a single-node field (SOURCE, PREVIOUS, NEXT, PARENT)
            receives a list, or if children receives a non-list value.

    Examples:
        - Creating from a dict with mixed relationships
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo, NodeType
            >>> source = NodeInfo(id="doc-1")
            >>> parent = NodeInfo(id="section-1")
            >>> children = [NodeInfo(id="para-1"), NodeInfo(id="para-2")]
            >>> links_dict = {
            ...     NodeType.SOURCE: source,
            ...     NodeType.PARENT: parent,
            ...     NodeType.CHILD: children
            ... }
            >>> links = LinkedNodes.create(links_dict)
            >>> links.source.id
            'doc-1'

            ```
        - Creating with only some relationships
            ```python
            >>> prev = NodeInfo(id="chunk-1")
            >>> next_node = NodeInfo(id="chunk-3")
            >>> links = LinkedNodes.create({
            ...     NodeType.PREVIOUS: prev,
            ...     NodeType.NEXT: next_node
            ... })
            >>> links.previous.id
            'chunk-1'

            ```
        - Empty dict creates all-None instance
            ```python
            >>> links = LinkedNodes.create({})
            >>> links.source is None
            True

            ```

    See Also:
        LinkedNodes.as_dict: Inverse operation converting LinkedNodes to dict.
        NodeType: Enum defining valid relationship types.
    """
    return cls(
        source=linked_nodes_info.get(NodeType.SOURCE),
        previous=linked_nodes_info.get(NodeType.PREVIOUS),
        next=linked_nodes_info.get(NodeType.NEXT),
        parent=linked_nodes_info.get(NodeType.PARENT),
        children=linked_nodes_info.get(NodeType.CHILD),
    )

validate_children_list(v) classmethod #

Validate that children field contains a list of NodeInfo objects.

Ensures the children field is a list (not a single NodeInfo instance). Called automatically by Pydantic during model instantiation and validation.

Parameters:

Name Type Description Default
v Any

Value to validate, expected to be list[NodeInfo] or None.

required

Returns:

Type Description
list[NodeInfo] | None

The validated list of NodeInfo instances or None.

Raises:

Type Description
ValueError

If v is not None and not a list.

Examples:

  • Valid children list
    >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
    >>> child1 = NodeInfo(id="child-1")
    >>> child2 = NodeInfo(id="child-2")
    >>> links = LinkedNodes(children=[child1, child2])
    >>> [c.id for c in links.children]
    ['child-1', 'child-2']
    
  • Invalid single NodeInfo for children raises ValidationError
    >>> from pydantic import ValidationError
    >>> try:
    ...     LinkedNodes(children=NodeInfo(id="bad"))
    ... except ValidationError as e:
    ...     'children' in str(e)
    True
    
  • Empty children list is valid
    >>> links = LinkedNodes(children=[])
    >>> links.children
    []
    
Note

This validator is specific to the children field, which represents one-to-many relationships.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@field_validator("children")
@classmethod
def validate_children_list(cls, v: Any) -> list[NodeInfo] | None:
    """Validate that children field contains a list of NodeInfo objects.

    Ensures the children field is a list (not a single NodeInfo instance).
    Called automatically by Pydantic during model instantiation and validation.

    Args:
        v: Value to validate, expected to be list[NodeInfo] or None.

    Returns:
        The validated list of NodeInfo instances or None.

    Raises:
        ValueError: If v is not None and not a list.

    Examples:
        - Valid children list
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
            >>> child1 = NodeInfo(id="child-1")
            >>> child2 = NodeInfo(id="child-2")
            >>> links = LinkedNodes(children=[child1, child2])
            >>> [c.id for c in links.children]
            ['child-1', 'child-2']

            ```
        - Invalid single NodeInfo for children raises ValidationError
            ```python
            >>> from pydantic import ValidationError
            >>> try:
            ...     LinkedNodes(children=NodeInfo(id="bad"))
            ... except ValidationError as e:
            ...     'children' in str(e)
            True

            ```
        - Empty children list is valid
            ```python
            >>> links = LinkedNodes(children=[])
            >>> links.children
            []

            ```

    Note:
        This validator is specific to the children field, which represents
        one-to-many relationships.
    """
    if v is not None and not isinstance(v, list):
        raise ValueError("Children must be a list of NodeInfo objects")
    return v

validate_single_node(v) classmethod #

Validate that single-node fields contain NodeInfo objects.

Ensures that source, previous, next, and parent fields contain exactly one NodeInfo instance (not a list). Called automatically by Pydantic during model instantiation and validation.

Parameters:

Name Type Description Default
v Any

Value to validate, expected to be NodeInfo or None.

required

Returns:

Type Description
NodeInfo | None

The validated NodeInfo instance or None.

Raises:

Type Description
ValueError

If v is not None and not a NodeInfo instance.

Examples:

  • Valid single node assignment
    >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
    >>> node = NodeInfo(id="valid")
    >>> links = LinkedNodes(source=node)
    >>> links.source.id
    'valid'
    
  • Invalid list assignment to single-node field raises ValidationError
    >>> from pydantic import ValidationError
    >>> try:
    ...     LinkedNodes(source=[NodeInfo(id="bad")])
    ... except ValidationError as e:
    ...     'source' in str(e)
    True
    
Note

This validator applies to: source, previous, next, parent fields. The children field has a separate validator for list validation.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@field_validator("source", "previous", "next", "parent")
@classmethod
def validate_single_node(cls, v: Any) -> NodeInfo | None:
    """Validate that single-node fields contain NodeInfo objects.

    Ensures that source, previous, next, and parent fields contain exactly
    one NodeInfo instance (not a list). Called automatically by Pydantic
    during model instantiation and validation.

    Args:
        v: Value to validate, expected to be NodeInfo or None.

    Returns:
        The validated NodeInfo instance or None.

    Raises:
        ValueError: If v is not None and not a NodeInfo instance.

    Examples:
        - Valid single node assignment
            ```python
            >>> from serapeum.core.base.embeddings.types import LinkedNodes, NodeInfo
            >>> node = NodeInfo(id="valid")
            >>> links = LinkedNodes(source=node)
            >>> links.source.id
            'valid'

            ```
        - Invalid list assignment to single-node field raises ValidationError
            ```python
            >>> from pydantic import ValidationError
            >>> try:
            ...     LinkedNodes(source=[NodeInfo(id="bad")])
            ... except ValidationError as e:
            ...     'source' in str(e)
            True

            ```

    Note:
        This validator applies to: source, previous, next, parent fields.
        The children field has a separate validator for list validation.
    """
    if v is not None and not isinstance(v, NodeInfo):
        raise ValueError("Must be a NodeInfo object, not a list")
    return v

MetadataMode #

Bases: str, Enum

Enumeration for controlling which metadata is included in different contexts.

Different use cases require different metadata visibility. For example, you might exclude certain metadata from embeddings (to avoid semantic pollution) while including it for LLM context (to provide additional information).

Attributes:

Name Type Description
ALL

Include all metadata fields.

EMBED

Include only metadata for embedding generation (excludes fields in excluded_embed_metadata_keys).

LLM

Include only metadata for LLM context (excludes fields in excluded_llm_metadata_keys).

NONE

Exclude all metadata.

Examples:

  • Filtering metadata for embeddings
    >>> from serapeum.core.base.embeddings.types import MetadataMode
    >>> mode = MetadataMode.EMBED
    >>> mode.value
    'embed'
    
  • Using with node content retrieval (conceptual)
    >>> MetadataMode.LLM == "llm"
    True
    >>> MetadataMode.NONE == "none"
    True
    
  • All mode values are plain strings
    >>> MetadataMode.ALL.value
    'all'
    
See Also

BaseNode.get_content: Uses this mode to control metadata inclusion. BaseNode.get_metadata_str: Filters metadata based on this mode. BaseNode.excluded_embed_metadata_keys: Metadata excluded for EMBED mode. BaseNode.excluded_llm_metadata_keys: Metadata excluded for LLM mode.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
class MetadataMode(str, Enum):
    """Enumeration for controlling which metadata is included in different contexts.

    Different use cases require different metadata visibility. For example, you
    might exclude certain metadata from embeddings (to avoid semantic pollution)
    while including it for LLM context (to provide additional information).

    Attributes:
        ALL: Include all metadata fields.
        EMBED: Include only metadata for embedding generation (excludes fields
            in excluded_embed_metadata_keys).
        LLM: Include only metadata for LLM context (excludes fields in
            excluded_llm_metadata_keys).
        NONE: Exclude all metadata.

    Examples:
        - Filtering metadata for embeddings
            ```python
            >>> from serapeum.core.base.embeddings.types import MetadataMode
            >>> mode = MetadataMode.EMBED
            >>> mode.value
            'embed'

            ```
        - Using with node content retrieval (conceptual)
            ```python
            >>> MetadataMode.LLM == "llm"
            True
            >>> MetadataMode.NONE == "none"
            True

            ```
        - All mode values are plain strings
            ```python
            >>> MetadataMode.ALL.value
            'all'

            ```

    See Also:
        BaseNode.get_content: Uses this mode to control metadata inclusion.
        BaseNode.get_metadata_str: Filters metadata based on this mode.
        BaseNode.excluded_embed_metadata_keys: Metadata excluded for EMBED mode.
        BaseNode.excluded_llm_metadata_keys: Metadata excluded for LLM mode.
    """

    ALL = "all"
    EMBED = "embed"
    LLM = "llm"
    NONE = "none"

MockEmbedding #

Bases: BaseEmbedding

Mock embedding model for testing purposes.

Returns constant embedding vectors (all 0.5 values) for any input, allowing tests to run without requiring a real embedding model. This is useful for unit testing, integration testing, and development without the overhead of loading actual models or making API calls.

All embeddings returned are deterministic vectors of the specified dimension, filled with 0.5 values. This makes tests reproducible and fast.

Attributes:

Name Type Description
embed_dim int

Embedding dimension (must be positive).

model_name str

Model name identifier (defaults to "mock-embedding").

Examples:

  • Creating a mock embedding model

    >>> from serapeum.core.embeddings import MockEmbedding
    >>> emb = MockEmbedding(embed_dim=3)
    >>> emb.model_name
    'mock-embedding'
    >>> emb.embed_dim
    3
    

  • Getting embeddings returns constant vectors

    >>> emb = MockEmbedding(embed_dim=4)
    >>> result = emb.get_text_embedding("any text")
    >>> result
    [0.5, 0.5, 0.5, 0.5]
    

  • All inputs produce identical constant vectors

    >>> emb = MockEmbedding(embed_dim=2)
    >>> emb.get_text_embedding("hello")
    [0.5, 0.5]
    

  • Validation of embed_dim

    >>> MockEmbedding(embed_dim=0)  # doctest: +ELLIPSIS
    Traceback (most recent call last):
        ...
    pydantic_core._pydantic_core.ValidationError: 1 validation error...
    

See Also

BaseEmbedding: Abstract base class that MockEmbedding implements.

Source code in libs\core\src\serapeum\core\embeddings\types.py
class MockEmbedding(BaseEmbedding):
    """Mock embedding model for testing purposes.

    Returns constant embedding vectors (all 0.5 values) for any input,
    allowing tests to run without requiring a real embedding model. This is
    useful for unit testing, integration testing, and development without
    the overhead of loading actual models or making API calls.

    All embeddings returned are deterministic vectors of the specified dimension,
    filled with 0.5 values. This makes tests reproducible and fast.

    Attributes:
        embed_dim: Embedding dimension (must be positive).
        model_name: Model name identifier (defaults to "mock-embedding").

    Examples:
        - Creating a mock embedding model
            ```python
            >>> from serapeum.core.embeddings import MockEmbedding
            >>> emb = MockEmbedding(embed_dim=3)
            >>> emb.model_name
            'mock-embedding'
            >>> emb.embed_dim
            3

            ```

        - Getting embeddings returns constant vectors
            ```python
            >>> emb = MockEmbedding(embed_dim=4)
            >>> result = emb.get_text_embedding("any text")
            >>> result
            [0.5, 0.5, 0.5, 0.5]

            ```

        - All inputs produce identical constant vectors
            ```python
            >>> emb = MockEmbedding(embed_dim=2)
            >>> emb.get_text_embedding("hello")
            [0.5, 0.5]

            ```

        - Validation of embed_dim
            ```python
            >>> MockEmbedding(embed_dim=0)  # doctest: +ELLIPSIS
            Traceback (most recent call last):
                ...
            pydantic_core._pydantic_core.ValidationError: 1 validation error...

            ```

    See Also:
        BaseEmbedding: Abstract base class that MockEmbedding implements.
    """

    embed_dim: int = Field(
        ..., gt=0, description="Embedding dimension (must be positive)"
    )
    model_name: str = Field(
        default="mock-embedding", description="Model name identifier"
    )

    @field_validator("embed_dim")
    @classmethod
    def validate_embed_dim(cls, v: int) -> int:
        """Validate that embed_dim is positive.

        Args:
            v: The embed_dim value to validate.

        Returns:
            The validated embed_dim.

        Raises:
            ValueError: If embed_dim is not positive.
        """
        if v <= 0:
            raise ValueError(f"embed_dim must be positive, got {v}")
        return v

    @classmethod
    def class_name(cls) -> str:
        """Return the class name identifier.

        Returns:
            String "MockEmbedding" identifying this class.

        Examples:
            - Getting the class name
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> MockEmbedding.class_name()
                'MockEmbedding'

                ```
        """
        return "MockEmbedding"

    def _get_mocked_vector(self) -> Embedding:
        """Generate a mock embedding vector of constant values.

        Creates a vector of length embed_dim where all values are 0.5. This is
        the core method that all other embedding methods delegate to.

        Returns:
            List of floats with length equal to embed_dim, all values 0.5.

        Examples:
            - Generating a mock vector
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=5)
                >>> emb._get_mocked_vector()
                [0.5, 0.5, 0.5, 0.5, 0.5]

                ```

            - Vector values are all 0.5
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> emb._get_mocked_vector()
                [0.5, 0.5, 0.5]

                ```
        """
        return [0.5] * self.embed_dim

    def _get_query_embedding(self, query: str) -> Embedding:
        """Get query embedding (returns constant mock vector).

        This method ignores the input query and always returns the same mock
        vector. Implements the abstract method from BaseEmbedding.

        Args:
            query: Query text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Query embedding returns mock vector
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> emb._get_query_embedding("test query")
                [0.5, 0.5, 0.5]

                ```

            - Different queries return the same constant vector
                ```python
                >>> emb = MockEmbedding(embed_dim=2)
                >>> emb._get_query_embedding("query1")
                [0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

    def _get_text_embedding(self, text: str) -> Embedding:
        """Get text embedding (returns constant mock vector).

        This method ignores the input text and always returns the same mock
        vector. Implements the abstract method from BaseEmbedding.

        Args:
            text: Input text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Text embedding returns mock vector
                ```python
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=4)
                >>> emb._get_text_embedding("sample text")
                [0.5, 0.5, 0.5, 0.5]

                ```

            - Different texts return the same constant vector
                ```python
                >>> emb = MockEmbedding(embed_dim=2)
                >>> emb._get_text_embedding("text1")
                [0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

    async def _aget_query_embedding(self, query: str) -> Embedding:
        """Asynchronously get query embedding (returns constant mock vector).

        Async version of _get_query_embedding. This method ignores the input
        query and always returns the same mock vector. Implements the abstract
        method from BaseEmbedding.

        Args:
            query: Query text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Async query embedding
                ```python
                >>> import asyncio
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> asyncio.run(emb._aget_query_embedding("async query"))
                [0.5, 0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

    async def _aget_text_embedding(self, text: str) -> Embedding:
        """Asynchronously get text embedding (returns constant mock vector).

        Async version of _get_text_embedding. This method ignores the input
        text and always returns the same mock vector. Implements the abstract
        method from BaseEmbedding.

        Args:
            text: Input text (unused in mock implementation).

        Returns:
            Mock embedding vector with all values set to 0.5.

        Examples:
            - Async text embedding
                ```python
                >>> import asyncio
                >>> from serapeum.core.embeddings import MockEmbedding
                >>> emb = MockEmbedding(embed_dim=3)
                >>> asyncio.run(emb._aget_text_embedding("async text"))
                [0.5, 0.5, 0.5]

                ```
        """
        return self._get_mocked_vector()

class_name() classmethod #

Return the class name identifier.

Returns:

Type Description
str

String "MockEmbedding" identifying this class.

Examples:

  • Getting the class name
    >>> from serapeum.core.embeddings import MockEmbedding
    >>> MockEmbedding.class_name()
    'MockEmbedding'
    
Source code in libs\core\src\serapeum\core\embeddings\types.py
@classmethod
def class_name(cls) -> str:
    """Return the class name identifier.

    Returns:
        String "MockEmbedding" identifying this class.

    Examples:
        - Getting the class name
            ```python
            >>> from serapeum.core.embeddings import MockEmbedding
            >>> MockEmbedding.class_name()
            'MockEmbedding'

            ```
    """
    return "MockEmbedding"

validate_embed_dim(v) classmethod #

Validate that embed_dim is positive.

Parameters:

Name Type Description Default
v int

The embed_dim value to validate.

required

Returns:

Type Description
int

The validated embed_dim.

Raises:

Type Description
ValueError

If embed_dim is not positive.

Source code in libs\core\src\serapeum\core\embeddings\types.py
@field_validator("embed_dim")
@classmethod
def validate_embed_dim(cls, v: int) -> int:
    """Validate that embed_dim is positive.

    Args:
        v: The embed_dim value to validate.

    Returns:
        The validated embed_dim.

    Raises:
        ValueError: If embed_dim is not positive.
    """
    if v <= 0:
        raise ValueError(f"embed_dim must be positive, got {v}")
    return v

NodeContentType #

Bases: str, Enum

Enumeration of content types that can be stored in a node.

This enum classifies the type of content a node contains, which helps downstream components (LLMs, embeddings, parsers) handle the content appropriately. String-based enum values enable direct serialization.

Attributes:

Name Type Description
TEXT

Plain text content, the most common node type.

IMAGE

Image data or references to images.

INDEX

Index structures or metadata about other nodes.

DOCUMENT

Complete document content before chunking.

MULTIMODAL

Content combining multiple modalities (text + images).

Examples:

  • Checking content type
    >>> from serapeum.core.base.embeddings.types import NodeContentType
    >>> content_type = NodeContentType.TEXT
    >>> content_type.value
    'text'
    
  • Using in node metadata
    >>> from serapeum.core.base.embeddings.types import NodeInfo
    >>> node = NodeInfo(id="node-1", type=NodeContentType.IMAGE)
    >>> node.type
    <NodeContentType.IMAGE: 'image'>
    
  • String comparison
    >>> NodeContentType.TEXT == "text"
    True
    
See Also

NodeInfo: Uses this enum to specify node content type. BaseNode.get_type: Abstract method returning content type string.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
class NodeContentType(str, Enum):
    """Enumeration of content types that can be stored in a node.

    This enum classifies the type of content a node contains, which helps
    downstream components (LLMs, embeddings, parsers) handle the content
    appropriately. String-based enum values enable direct serialization.

    Attributes:
        TEXT: Plain text content, the most common node type.
        IMAGE: Image data or references to images.
        INDEX: Index structures or metadata about other nodes.
        DOCUMENT: Complete document content before chunking.
        MULTIMODAL: Content combining multiple modalities (text + images).

    Examples:
        - Checking content type
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeContentType
            >>> content_type = NodeContentType.TEXT
            >>> content_type.value
            'text'

            ```
        - Using in node metadata
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeInfo
            >>> node = NodeInfo(id="node-1", type=NodeContentType.IMAGE)
            >>> node.type
            <NodeContentType.IMAGE: 'image'>

            ```
        - String comparison
            ```python
            >>> NodeContentType.TEXT == "text"
            True

            ```

    See Also:
        NodeInfo: Uses this enum to specify node content type.
        BaseNode.get_type: Abstract method returning content type string.
    """

    TEXT = "text"
    IMAGE = "image"
    INDEX = "index"
    DOCUMENT = "document"
    MULTIMODAL = "multimodal"

NodeInfo #

Bases: SerializableModel

Lightweight reference to a node with essential identification metadata.

NodeInfo provides a compact representation of a node without its full content, useful for creating references and relationships between nodes. It includes the node's ID, content type, metadata, and optional hash for change detection.

Attributes:

Name Type Description
id str

Unique identifier for the node.

type Annotated[NodeContentType, EnumNameSerializer] | str | None

Content type classification (NodeContentType enum or string).

metadata dict[str, Any]

Arbitrary metadata dictionary for the node.

hash str | None

Optional hash value for detecting content changes.

Examples:

  • Creating a basic node reference
    >>> from serapeum.core.base.embeddings.types import NodeInfo, NodeContentType
    >>> ref = NodeInfo(
    ...     id="doc-456",
    ...     type=NodeContentType.TEXT,
    ...     metadata={"page": 1}
    ... )
    >>> ref.id
    'doc-456'
    
  • Serialization and deserialization
    >>> ref = NodeInfo(id="node-1", type=NodeContentType.DOCUMENT)
    >>> json_str = ref.to_json()
    >>> restored = NodeInfo.from_json(json_str)
    >>> restored.id
    'node-1'
    
  • Using with hash for change detection
    >>> import hashlib
    >>> content = "Sample text"
    >>> content_hash = hashlib.sha256(content.encode()).hexdigest()
    >>> ref = NodeInfo(id="node-2", hash=content_hash)
    >>> ref.hash[:8]
    '3a2c5c49'
    
See Also

BaseNode: Full node implementation that generates NodeInfo. LinkedNodes: Container for node relationships using NodeInfo. SerializableModel: Base class providing serialization methods.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
class NodeInfo(SerializableModel):
    """Lightweight reference to a node with essential identification metadata.

    NodeInfo provides a compact representation of a node without its full content,
    useful for creating references and relationships between nodes. It includes
    the node's ID, content type, metadata, and optional hash for change detection.

    Attributes:
        id: Unique identifier for the node.
        type: Content type classification (NodeContentType enum or string).
        metadata: Arbitrary metadata dictionary for the node.
        hash: Optional hash value for detecting content changes.

    Examples:
        - Creating a basic node reference
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeInfo, NodeContentType
            >>> ref = NodeInfo(
            ...     id="doc-456",
            ...     type=NodeContentType.TEXT,
            ...     metadata={"page": 1}
            ... )
            >>> ref.id
            'doc-456'

            ```
        - Serialization and deserialization
            ```python
            >>> ref = NodeInfo(id="node-1", type=NodeContentType.DOCUMENT)
            >>> json_str = ref.to_json()
            >>> restored = NodeInfo.from_json(json_str)
            >>> restored.id
            'node-1'

            ```
        - Using with hash for change detection
            ```python
            >>> import hashlib
            >>> content = "Sample text"
            >>> content_hash = hashlib.sha256(content.encode()).hexdigest()
            >>> ref = NodeInfo(id="node-2", hash=content_hash)
            >>> ref.hash[:8]
            '3a2c5c49'

            ```

    See Also:
        BaseNode: Full node implementation that generates NodeInfo.
        LinkedNodes: Container for node relationships using NodeInfo.
        SerializableModel: Base class providing serialization methods.
    """

    id: str
    type: Annotated[NodeContentType, EnumNameSerializer] | str | None = None
    metadata: dict[str, Any] = Field(default_factory=dict)
    hash: str | None = None

    @classmethod
    def class_name(cls) -> str:
        """Return the class name identifier for serialization.

        Returns:
            Always returns "NodeInfo" as the stable class identifier.

        Examples:
            - Getting class name
                ```python
                >>> from serapeum.core.base.embeddings.types import NodeInfo
                >>> NodeInfo.class_name()
                'NodeInfo'

                ```
        """
        return "NodeInfo"

class_name() classmethod #

Return the class name identifier for serialization.

Returns:

Type Description
str

Always returns "NodeInfo" as the stable class identifier.

Examples:

  • Getting class name
    >>> from serapeum.core.base.embeddings.types import NodeInfo
    >>> NodeInfo.class_name()
    'NodeInfo'
    
Source code in libs\core\src\serapeum\core\base\embeddings\types.py
@classmethod
def class_name(cls) -> str:
    """Return the class name identifier for serialization.

    Returns:
        Always returns "NodeInfo" as the stable class identifier.

    Examples:
        - Getting class name
            ```python
            >>> from serapeum.core.base.embeddings.types import NodeInfo
            >>> NodeInfo.class_name()
            'NodeInfo'

            ```
    """
    return "NodeInfo"

NodeType #

Bases: str, Enum

Node links used in BaseNode class.

Attributes:

Name Type Description
SOURCE

The node is the source document.

PREVIOUS

The node is the previous node in the document.

NEXT

The node is the next node in the document.

PARENT

The node is the parent node in the document.

CHILD

The node is a child node in the document.

Source code in libs\core\src\serapeum\core\base\embeddings\types.py
class NodeType(str, Enum):
    """
    Node links used in `BaseNode` class.

    Attributes:
        SOURCE: The node is the source document.
        PREVIOUS: The node is the previous node in the document.
        NEXT: The node is the next node in the document.
        PARENT: The node is the parent node in the document.
        CHILD: The node is a child node in the document.

    """

    SOURCE = "source"
    PREVIOUS = "previous"
    NEXT = "next"
    PARENT = "parent"
    CHILD = "child"