splitters ¤

This module implements document splitters for document chunking.

DocumentSplitterLike ¤

Bases: Protocol

A protocol that defines the methods that a document splitter should implement.

__call__(
    documents: list[Document],
    chunk_size: int = 100,
    chunk_overlap: int = 0,
) -> list[Document]

Split a list of documents into smaller chunks.

Parameters:

documents (list[Document]) –

The list of documents to be split.
chunk_size (int, default: 100 ) –

The size of each chunk.
chunk_overlap (int, default: 0 ) –

The overlap between consecutive chunks.

Returns:

HuggingFaceTextSplitter(
    tokenizer: PreTrainedTokenizerBase,
    separators: list[str] | None = None,
)

Text-based document splitter which uses a HuggingFace tokenizer to calculate length when splitting.

It uses Langchain’s RecursiveCharacterTextSplitter and expects the list of documents to be in plain text.

It implements the DocumentSplitterLike protocol.

Attributes:

Parameters:

tokenizer (PreTrainedTokenizerBase) –

The HuggingFace tokenizer to use for calculating length.
separators (list[str] | None, default: None ) –

The list of separators to use. When None, the default separators are used: ["\n\n", "\n", ".", "!", "?"].

HuggingFaceMarkdownSplitter(
    tokenizer: PreTrainedTokenizerBase,
)

Markdown-based document splitter which uses a HuggingFace tokenizer to calculate length of chunks when splitting.

It uses Langchain’s MarkdownTextSplitter and expects the list of documents to be in Markdown.

It implements the DocumentSplitterLike protocol.

Attributes:

Parameters:

tokenizer (PreTrainedTokenizerBase) –

The HuggingFace tokenizer to use to calculate length.

RaptorSplitter(
    tokenizer: TokenizerLike,
    *,
    separators: list[str] | None = None
)

Document splitter which implements the chunking technique as defined in the RAPTOR paper.

It expects a tokenizer which implements the TokenizerLike protocol to calculate the length of chunks.

It implements the DocumentSplitterLike protocol.

Attributes:

Parameters:

tokenizer (TokenizerLike) –

Tokenizer to use for calculating chunk lengths.
separators (list[str] | None, default: None ) –

The list of separators to use. When None, the default separators are used: [".", "!", "?", "\n"].

split_single_document(
    document: Document, chunk_size: int, chunk_overlap: int
) -> list[Document]

Split a single document into chunks.

Parameters: