loaders
¤
This module defines functions for loading PDF documents and some utilities to manage loaders.
LoaderLike
¤
Bases: Protocol
A protocol that all document loaders should follow.
__call__
¤
Load a PDF document.
Parameters:
-
file_path
(str
) –The path to the PDF file.
-
start_page
(int
, default:0
) –The starting (0-based) page number in the PDF to begin reading from.
-
end_page
(int | None
, default:None
) –The ending (0-based) page number to stop reading at (non-inclusive). When
None
, all pages in the PDF are read.
Returns:
LoaderManager
¤
Bases: UserDict[str, LoaderLike]
Manager to maintain registry of all document loaders.
It behaves like a dictionary, where each document loader is registered to a name.
Examples:
from bookacle.loaders import LoaderManager, register_loader
from langchain_core.documents import Document
manager = LoaderManager()
@register_loader(name="custom_loader", manager=manager)
def doc_loader(file_path: str, start_page: int = 0, end_page: int | None = None) -> list[Document]:
...
print(manager["custom_loader"] is doc_loader)
True
register_loader
¤
register_loader(
name: str, manager: LoaderManager | None = None
) -> Callable[[LoaderLike], LoaderLike]
A decorator that registers a loader function with the loader manager.
Parameters:
-
name
(str
) –The name to map the loader function to.
-
manager
(LoaderManager | None
, default:None
) –The manager to register the function with. If
None
,LOADER_MANAGER
is used.
pymupdf4llm_loader
¤
pymupdf4llm_loader(
file_path: str,
start_page: int = 0,
end_page: int | None = None,
) -> list[Document]
Document loader which uses pymupdf4llm
to load the PDF as Markdown.
Can be accessed using the name 'pymupdf4llm'
via the default loader manager.
It implements the LoaderLike protocol.