Skip to content

loaders ¤

This module defines functions for loading PDF documents and some utilities to manage loaders.

LOADER_MANAGER module-attribute ¤

LOADER_MANAGER = LoaderManager()

Default loader manager.

LoaderLike ¤

Bases: Protocol

A protocol that all document loaders should follow.

__call__ ¤

__call__(
    file_path: str,
    start_page: int = 0,
    end_page: int | None = None,
) -> list[Document]

Load a PDF document.

Parameters:

  • file_path (str) –

    The path to the PDF file.

  • start_page (int, default: 0 ) –

    The starting (0-based) page number in the PDF to begin reading from.

  • end_page (int | None, default: None ) –

    The ending (0-based) page number to stop reading at (non-inclusive). When None, all pages in the PDF are read.

Returns:

LoaderManager ¤

Bases: UserDict[str, LoaderLike]

Manager to maintain registry of all document loaders.

It behaves like a dictionary, where each document loader is registered to a name.

Examples:

from bookacle.loaders import LoaderManager, register_loader
from langchain_core.documents import Document

manager = LoaderManager()

@register_loader(name="custom_loader", manager=manager)
def doc_loader(file_path: str, start_page: int = 0, end_page: int | None = None) -> list[Document]:
    ...

print(manager["custom_loader"] is doc_loader)
True

enum property ¤

enum: Enum

Obtain the names of the document loaders as an Enum.

Useful in the CLI for --help.

register_loader ¤

register_loader(
    name: str, manager: LoaderManager | None = None
) -> Callable[[LoaderLike], LoaderLike]

A decorator that registers a loader function with the loader manager.

Parameters:

  • name (str) –

    The name to map the loader function to.

  • manager (LoaderManager | None, default: None ) –

    The manager to register the function with. If None, LOADER_MANAGER is used.

pymupdf4llm_loader ¤

pymupdf4llm_loader(
    file_path: str,
    start_page: int = 0,
    end_page: int | None = None,
) -> list[Document]

Document loader which uses pymupdf4llm to load the PDF as Markdown.

Can be accessed using the name 'pymupdf4llm' via the default loader manager.

It implements the LoaderLike protocol.

pymupdf_loader ¤

pymupdf_loader(
    file_path: str,
    start_page: int = 0,
    end_page: int | None = None,
) -> list[Document]

Document loader which uses pymupdf to load the PDF as text.

Can be accessed using the name 'pymupdf' via the default loader manager.

It implements the LoaderLike protocol.