Skip to content

clustering ¤

Clustering module for the tree structure.

ClusteringBackendLike ¤

Bases: Protocol

A protocol that defines the interface a clustering backend should implement.

cluster ¤

cluster(
    embeddings: NDArray[float64], *args, **kwargs
) -> tuple[dict[int, list[int]], dict[int, list[int]]]

Cluster the embeddings.

Parameters:

  • embeddings (NDArray[float64]) –

    The embeddings to cluster.

  • *args

    Additional arguments to pass to the clustering function.

  • **kwargs

    Additional keyword arguments to pass to the clustering function.

Returns:

ClusteringFunctionLike ¤

Bases: Protocol

A protocol that defines the interface a clustering function should implement.

__call__ ¤

__call__(
    nodes: list[Node],
    tokenizer: TokenizerLike,
    clustering_backend: ClusteringBackendLike | None = None,
    max_length_in_cluster: int = 3500,
    reduction_dimension: int = 10,
    *args,
    **kwargs
) -> list[list[Node]]

Cluster nodes using the given clustering backend.

Parameters:

  • nodes (list[Node]) –

    The nodes to cluster.

  • tokenizer (TokenizerLike) –

    The tokenizer to use to calculate the total length of text in each cluster.

  • clustering_backend (ClusteringBackendLike | None, default: None ) –

    The clustering backend to use.

  • max_length_in_cluster (int, default: 3500 ) –

    The maximum length of text in a cluster.

  • reduction_dimension (int, default: 10 ) –

    The dimension to reduce the embeddings to before clustering.

  • *args

    Additional arguments to pass to the clustering function.

  • **kwargs

    Additional keyword arguments to pass to the clustering function.

Returns:

GMMClusteringBackend ¤

GMMClusteringBackend(
    reduction_dim: int,
    max_clusters: int = 50,
    random_state: int = 42,
    n_neighbors_global: int | None = None,
    n_neighbors_local: int = 10,
    n_clusters_global: int | None = None,
    n_clusters_local: int | None = None,
    umap_metric: str = "cosine",
    umap_low_memory: bool = True,
)

A Gaussian Mixture Model (GMM) clustering backend.

It implements the ClusteringBackendLike protocol.

Attributes:

  • reduction_dim (int) –

    The dimension to reduce the embeddings to before clustering.

  • max_clusters (int) –

    The maximum number of clusters to use.

  • random_state (int) –

    Random state for reproducibility.

  • n_neighbors_global (int | None) –

    The number of neighbors to use for global clustering.

  • n_neighbors_local (int | None) –

    The number of neighbors to use for local clustering.

  • n_clusters_global (int | None) –

    The number of clusters to use for global clustering.

  • n_clusters_local (int | None) –

    The number of clusters to use for local clustering.

  • umap_metric (str) –

    The metric to use for UMAP.

  • umap_low_memory (bool) –

    Whether to use low memory mode for UMAP.

Parameters:

  • reduction_dim (int) –

    The dimension to reduce the embeddings to before clustering.

  • max_clusters (int, default: 50 ) –

    The maximum number of clusters to use.

  • random_state (int, default: 42 ) –

    Random state for reproducibility.

  • n_neighbors_global (int | None, default: None ) –

    The number of neighbors to use for global clustering. When None, it is calculated using Bayesian Information Criterion (BIC).

  • n_neighbors_local (int, default: 10 ) –

    The number of neighbors to use for local clustering. When None, it is calculated using Bayesian Information Criterion (BIC).

  • n_clusters_global (int | None, default: None ) –

    The number of clusters to use for global clustering.

  • n_clusters_local (int | None, default: None ) –

    The number of clusters to use for local clustering.

  • umap_metric (str, default: 'cosine' ) –

    The metric to use for UMAP.

  • umap_low_memory (bool, default: True ) –

    Whether to use low memory mode for UMAP.

get_optimal_clusters_count ¤

get_optimal_clusters_count(
    embeddings: NDArray[float64],
) -> int

Get the optimal number of clusters using the Bayesian Information Criterion (BIC).

The method fits multiple Gaussian Mixture Models (GMM) to the embeddings and calculates the BIC for each. The number of clusters with the lowest BIC score is selected as the optimal number.

Parameters:

Returns:

  • int

    The optimal number of clusters.

get_clusters ¤

get_clusters(
    embeddings: NDArray[float64], n_clusters: int
) -> NDArray[int64]

Fit the GMM model to embeddings and return cluster assignments.

Parameters:

  • embeddings (NDArray[float64]) –

    The embeddings to cluster.

  • n_clusters (int) –

    The number of clusters to use.

Returns:

reduce_and_cluster_embeddings ¤

reduce_and_cluster_embeddings(
    embeddings: NDArray[float64],
    n_components: int,
    n_neighbors: int,
    n_clusters: int | None = None,
) -> NDArray[int64]

Reduce the dimensionality of the embeddings using UMAP and cluster them using GMM.

It uses umap_reduce_embeddings() to reduce the dimensionality of the embeddings and get_clusters() to cluster them.

Generally speaking, this method should be used instead of calling get_clusters() directly.

Parameters:

  • embeddings (NDArray[float64]) –

    The embeddings to cluster.

  • n_components (int) –

    The number of components in the reduced space.

  • n_neighbors (int) –

    The number of neighbors to use for UMAP.

  • n_clusters (int | None, default: None ) –

    The number of clusters to use. When None, the optimal number of clusters is calculated using get_optimal_clusters_count().

Returns:

cluster_locally ¤

cluster_locally(
    embeddings: NDArray[float64],
    global_cluster_indices: NDArray[int64],
    n_clusters: int | None = None,
    n_neighbors: int = 10,
) -> list[NDArray[int64]]

Cluster the embeddings of a global cluster locally. In other words, create new clusters from the embeddings of a global cluster.

If the number of embeddings in the global cluster is less than or equal to reduction_dim, the global cluster is returned as is in a singleton list.

Parameters:

  • embeddings (NDArray[float64]) –

    The overall embeddings.

  • global_cluster_indices (NDArray[int64]) –

    The indices of the embeddings belonging to the global cluster.

  • n_clusters (int | None, default: None ) –

    The number of clusters to output. When None, the optimal number of clusters is calculated using BIC.

  • n_neighbors (int, default: 10 ) –

    The number of neighbors to use for UMAP.

Returns:

Examples:

import numpy as np
from bookacle.tree.clustering import GMMClusteringBackend

backend = GMMClusteringBackend(reduction_dim=10)
embeddings = np.random.rand(100, 768)
indices = np.random.choice(100, 30)
clusters = backend.cluster_locally(
    embeddings=embeddings,
    global_cluster_indices=indices,
    n_clusters=5,
    n_neighbors=10,
)
print(clusters)
[array([65, 69, 37, 76, 51, 35]), array([95, 66, 68,  9, 83]), array([44, 59, 59, 97,  2, 23, 60, 60, 10]), array([39, 39, 58, 73, 34,  6]), array([14, 77, 61, 89])]

cluster ¤

cluster(
    embeddings: NDArray[float64],
) -> tuple[dict[int, list[int]], dict[int, list[int]]]

Cluster the embeddings.

The clustering is done as follows
  • Global clustering: The embeddings are reduced and clustered globally. That is, the entire set of embeddings is clustered.
  • Local clustering: The embeddings of each global cluster are reduced and clustered.
  • At the end, all local clusters are aggregated into a single result.

Parallel from joblib is used to parallelize the clustering of each global cluster.

Parameters:

Returns:

Examples:

import numpy as np
from bookacle.tree.clustering import GMMClusteringBackend

backend = GMMClusteringBackend(reduction_dim=10, n_clusters_global=3)
embeddings = np.random.rand(10, 768)
emb_to_clusters, clusters_to_emb = backend.cluster(embeddings=embeddings)
print(clusters_to_emb)
defaultdict(<class 'list'>, {0: [3, 8], 1: [1, 4, 7, 9], 2: [0, 2, 5, 6]})

umap_reduce_embeddings ¤

umap_reduce_embeddings(
    embeddings: NDArray[float64],
    n_components: int,
    neighbors: int = 10,
    metric: str = "cosine",
    low_memory: bool = True,
) -> NDArray[float64]

Reduce the dimensionality of the embeddings using UMAP.

Parameters:

  • embeddings (NDArray[float64]) –

    The embeddings to reduce.

  • n_components (int) –

    The number of components in the reduced space.

  • neighbors (int, default: 10 ) –

    The number of neighbors to use for UMAP.

  • metric (str, default: 'cosine' ) –

    The metric to use for UMAP.

  • low_memory (bool, default: True ) –

    Whether to use low memory mode for UMAP.

Returns:

raptor_clustering ¤

raptor_clustering(
    nodes: list[Node],
    tokenizer: TokenizerLike,
    clustering_backend: ClusteringBackendLike | None = None,
    max_length_in_cluster: int = 3500,
    reduction_dimension: int = 10,
) -> list[list[Node]]

Cluster nodes using RAPTOR clustering.

It implements the ClusteringFunctionLike protocol.

To cluster the nodes
  • The nodes are clustered using the given clustering backend.
  • For each cluster:
    • If the cluster has only one node or the total length of text in the cluster is less than the maximum length, the cluster is kept as is.
    • Otherwise, the cluster is recursively clustered.