splitters
¤
This module implements document splitters for document chunking.
DocumentSplitterLike
¤
Bases: Protocol
A protocol that defines the methods that a document splitter should implement.
__call__
¤
__call__(
documents: list[Document],
chunk_size: int = 100,
chunk_overlap: int = 0,
) -> list[Document]
HuggingFaceTextSplitter
¤
HuggingFaceTextSplitter(
tokenizer: PreTrainedTokenizerBase,
separators: list[str] | None = None,
)
Text-based document splitter which uses a HuggingFace tokenizer to calculate length when splitting.
It uses Langchain’s RecursiveCharacterTextSplitter and expects the list of documents to be in plain text.
It implements the DocumentSplitterLike protocol.
Attributes:
-
tokenizer
–The HuggingFace tokenizer to use for calculating length.
-
separators
–The list of separators to use for splitting the document.
Parameters:
-
tokenizer
(PreTrainedTokenizerBase
) –The HuggingFace tokenizer to use for calculating length.
-
separators
(list[str] | None
, default:None
) –The list of separators to use. When
None
, the default separators are used:["\n\n", "\n", ".", "!", "?"]
.
HuggingFaceMarkdownSplitter
¤
HuggingFaceMarkdownSplitter(
tokenizer: PreTrainedTokenizerBase,
)
Markdown-based document splitter which uses a HuggingFace tokenizer to calculate length of chunks when splitting.
It uses Langchain’s MarkdownTextSplitter and expects the list of documents to be in Markdown.
It implements the DocumentSplitterLike protocol.
Attributes:
-
tokenizer
–The HuggingFace tokenizer to use for calculating length.
Parameters:
-
tokenizer
(PreTrainedTokenizerBase
) –The HuggingFace tokenizer to use to calculate length.
RaptorSplitter
¤
RaptorSplitter(
tokenizer: TokenizerLike,
*,
separators: list[str] | None = None
)
Document splitter which implements the chunking technique as defined in the RAPTOR paper.
It expects a tokenizer which implements the TokenizerLike protocol to calculate the length of chunks.
For more details, see: https://github.com/parthsarthi03/raptor/blob/7da1d48a7e1d7dec61a63c9d9aae84e2dfaa5767/raptor/utils.py#L22.
It implements the DocumentSplitterLike protocol.
Attributes:
-
tokenizer
–Tokenizer to use for calculating chunk lengths.
-
separators
–The list of separators to use for splitting the document.
Parameters:
-
tokenizer
(TokenizerLike
) –Tokenizer to use for calculating chunk lengths.
-
separators
(list[str] | None
, default:None
) –The list of separators to use. When
None
, the default separators are used:[".", "!", "?", "\n"]
.