loaders
¤
This module defines functions for loading PDF documents and some utilities to manage loaders.
LoaderLike
¤
Bases: Protocol
A protocol that all document loaders should follow.
__call__
¤
Load a PDF document.
Parameters:
-
file_path(str) –The path to the PDF file.
-
start_page(int, default:0) –The starting (0-based) page number in the PDF to begin reading from.
-
end_page(int | None, default:None) –The ending (0-based) page number to stop reading at (non-inclusive). When
None, all pages in the PDF are read.
Returns:
LoaderManager
¤
Bases: UserDict[str, LoaderLike]
Manager to maintain registry of all document loaders.
It behaves like a dictionary, where each document loader is registered to a name.
Examples:
from bookacle.loaders import LoaderManager, register_loader
from langchain_core.documents import Document
manager = LoaderManager()
@register_loader(name="custom_loader", manager=manager)
def doc_loader(file_path: str, start_page: int = 0, end_page: int | None = None) -> list[Document]:
...
print(manager["custom_loader"] is doc_loader)
True
register_loader
¤
register_loader(
name: str, manager: LoaderManager | None = None
) -> Callable[[LoaderLike], LoaderLike]
A decorator that registers a loader function with the loader manager.
Parameters:
-
name(str) –The name to map the loader function to.
-
manager(LoaderManager | None, default:None) –The manager to register the function with. If
None,LOADER_MANAGERis used.
pymupdf4llm_loader
¤
pymupdf4llm_loader(
file_path: str,
start_page: int = 0,
end_page: int | None = None,
) -> list[Document]
Document loader which uses pymupdf4llm to load the PDF as Markdown.
Can be accessed using the name 'pymupdf4llm' via the default loader manager.
It implements the LoaderLike protocol.