Skip to content

splitters ¤

This module implements document splitters for document chunking.

DocumentSplitterLike ¤

Bases: Protocol

A protocol that defines the methods that a document splitter should implement.

__call__ ¤

__call__(
    documents: list[Document],
    chunk_size: int = 100,
    chunk_overlap: int = 0,
) -> list[Document]

Split a list of documents into smaller chunks.

Parameters:

  • documents (list[Document]) –

    The list of documents to be split.

  • chunk_size (int, default: 100 ) –

    The size of each chunk.

  • chunk_overlap (int, default: 0 ) –

    The overlap between consecutive chunks.

Returns:

  • list[Document]

    The list of documents after splitting into chunks.

HuggingFaceTextSplitter ¤

HuggingFaceTextSplitter(
    tokenizer: PreTrainedTokenizerBase,
    separators: list[str] | None = None,
)

Text-based document splitter which uses a HuggingFace tokenizer to calculate length when splitting.

It uses Langchain’s RecursiveCharacterTextSplitter and expects the list of documents to be in plain text.

It implements the DocumentSplitterLike protocol.

Attributes:

  • tokenizer

    The HuggingFace tokenizer to use for calculating length.

  • separators

    The list of separators to use for splitting the document.

Parameters:

  • tokenizer (PreTrainedTokenizerBase) –

    The HuggingFace tokenizer to use for calculating length.

  • separators (list[str] | None, default: None ) –

    The list of separators to use. When None, the default separators are used: ["\n\n", "\n", ".", "!", "?"].

HuggingFaceMarkdownSplitter ¤

HuggingFaceMarkdownSplitter(
    tokenizer: PreTrainedTokenizerBase,
)

Markdown-based document splitter which uses a HuggingFace tokenizer to calculate length of chunks when splitting.

It uses Langchain’s MarkdownTextSplitter and expects the list of documents to be in Markdown.

It implements the DocumentSplitterLike protocol.

Attributes:

  • tokenizer

    The HuggingFace tokenizer to use for calculating length.

Parameters:

RaptorSplitter ¤

RaptorSplitter(
    tokenizer: TokenizerLike,
    *,
    separators: list[str] | None = None
)

Document splitter which implements the chunking technique as defined in the RAPTOR paper.

It expects a tokenizer which implements the TokenizerLike protocol to calculate the length of chunks.

For more details, see: https://github.com/parthsarthi03/raptor/blob/7da1d48a7e1d7dec61a63c9d9aae84e2dfaa5767/raptor/utils.py#L22.

It implements the DocumentSplitterLike protocol.

Attributes:

  • tokenizer

    Tokenizer to use for calculating chunk lengths.

  • separators

    The list of separators to use for splitting the document.

Parameters:

  • tokenizer (TokenizerLike) –

    Tokenizer to use for calculating chunk lengths.

  • separators (list[str] | None, default: None ) –

    The list of separators to use. When None, the default separators are used: [".", "!", "?", "\n"].

split_single_document ¤

split_single_document(
    document: Document, chunk_size: int, chunk_overlap: int
) -> list[Document]

Split a single document into chunks.

Parameters:

  • document (Document) –

    Document to split into chunks.

  • chunk_size (int) –

    Maximum size of each chunk.

  • chunk_overlap (int) –

    Overlap between each chunk.