paperap.scripts.describe module

METADATA:

File: describe.py: Project: paperap
Created: 2025-03-18: Version: 0.0.9

Author: Jess Mann Email: jess@jmann.me

Copyright (c) 2025 Jess Mann

LAST MODIFIED:

2025-03-18 By Jess Mann

class paperap.scripts.describe.ScriptDefaults(*values)[source]

Bases: StrEnum

NEEDS_DESCRIPTION = 'needs-description'

DESCRIBED = 'described'

NEEDS_TITLE = 'needs-title'

NEEDS_DATE = 'needs-date'

MODEL = 'gpt-4o-mini'

class paperap.scripts.describe.DescribePhotos(**data)[source]

Bases: BaseModel

Describes photos in the Paperless NGX instance using an LLM (such as OpenAI’s GPT-4o-mini model).

Parameters:: data (Any)

max_threads: int

paperless_tag: str | None

prompt: str | None

client: PaperlessClient

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property progress_bar: ProgressBar

property openai_url: str | None

property openai_key: str | None

property openai_model: str

property openai: OpenAI

classmethod validate_max_threads(value)[source]

Parameters:: value (Any)
Return type:: int

property jinja_env: Environment

choose_template(document)[source]

Choose a jinja template for a document

Parameters:: document (Document)
Return type:: str

get_prompt(document)[source]

Generate a prompt to sent to openai using a jinja template.

Parameters:: document (Document)
Return type:: str

extract_images_from_pdf(pdf_bytes, max_images=2)[source]

Extract the first image from a PDF file.

Parameters:

pdf_bytes (bytes) – The PDF file content as bytes.
max_images (int)

Returns:

The first {max_images} images as bytes or None if no image is found.

Return type:

bytes | None

parse_date(date_str)[source]

Parse a date string.

Parameters:: date_str (str) – The date string to parse.
Returns:: The parsed date.
Return type:: date

parse_datetime(date_str)[source]

Parse a date string.

Parameters:: date_str (str) – The date string to parse.
Returns:: The parsed date.
Return type:: date

standardize_image_contents(content)[source]

Standardize image contents to base64-encoded PNG format.

Parameters:: content (bytes)
Return type:: list[str]

convert_image_to_jpg(bytes_content)[source]

Convert an image to JPEG format.

Parameters:: bytes_content (bytes) – The image content as bytes.
Returns:: The image content as JPEG.
Return type:: bytes

describe_document(document)[source]

Describe a single document using OpenAI’s GPT-4o model.

The document object passed in will be updated with the description.

Parameters:: document (Document) – The document to describe.
Returns:: True if the document was successfully described
Return type:: bool

process_response(response, document)[source]

Process the response from OpenAI and update the document.

Parameters:

response (str) – The response from OpenAI
document (Document) – The document to update

Returns:

The updated document

Return type:

Document

describe_documents(documents=None)[source]

Describe a list of documents using OpenAI’s GPT-4o model.

Parameters:: documents (list[Document]) – The documents to describe.
Returns:: The documents with the descriptions added.
Return type:: list[Document]

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

self (BaseModel) – The BaseModel instance.
context (Any) – The context.

Return type:

None

class paperap.scripts.describe.ArgNamespace(**kwargs)[source]

Bases: Namespace

A custom namespace class for argparse.

url: str

key: str

model: str | None = None

openai_url: str | None = None

tag: str

prompt: str | None = None

verbose: bool = False

paperap.scripts.describe.main()[source]

Run the script.

Return type:: None