paperap.scripts.describe module


METADATA:

File: describe.py

Project: paperap

Created: 2025-03-18

Version: 0.0.9

Author: Jess Mann Email: jess@jmann.me

Copyright (c) 2025 Jess Mann


LAST MODIFIED:

2025-03-18 By Jess Mann

class paperap.scripts.describe.ScriptDefaults(*values)[source]

Bases: StrEnum

NEEDS_DESCRIPTION = 'needs-description'
DESCRIBED = 'described'
NEEDS_TITLE = 'needs-title'
NEEDS_DATE = 'needs-date'
MODEL = 'gpt-4o-mini'
class paperap.scripts.describe.DescribePhotos(**data)[source]

Bases: BaseModel

Describes photos in the Paperless NGX instance using an LLM (such as OpenAI’s GPT-4o-mini model).

Parameters:

data (Any)

max_threads: int
paperless_tag: str | None
prompt: str | None
client: PaperlessClient
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property progress_bar: ProgressBar
property openai_url: str | None
property openai_key: str | None
property openai_model: str
property openai: OpenAI
classmethod validate_max_threads(value)[source]
Parameters:

value (Any)

Return type:

int

property jinja_env: Environment
choose_template(document)[source]

Choose a jinja template for a document

Parameters:

document (Document)

Return type:

str

get_prompt(document)[source]

Generate a prompt to sent to openai using a jinja template.

Parameters:

document (Document)

Return type:

str

extract_images_from_pdf(pdf_bytes, max_images=2)[source]

Extract the first image from a PDF file.

Parameters:
  • pdf_bytes (bytes) – The PDF file content as bytes.

  • max_images (int)

Returns:

The first {max_images} images as bytes or None if no image is found.

Return type:

bytes | None

parse_date(date_str)[source]

Parse a date string.

Parameters:

date_str (str) – The date string to parse.

Returns:

The parsed date.

Return type:

date

parse_datetime(date_str)[source]

Parse a date string.

Parameters:

date_str (str) – The date string to parse.

Returns:

The parsed date.

Return type:

date

standardize_image_contents(content)[source]

Standardize image contents to base64-encoded PNG format.

Parameters:

content (bytes)

Return type:

list[str]

convert_image_to_jpg(bytes_content)[source]

Convert an image to JPEG format.

Parameters:

bytes_content (bytes) – The image content as bytes.

Returns:

The image content as JPEG.

Return type:

bytes

describe_document(document)[source]

Describe a single document using OpenAI’s GPT-4o model.

The document object passed in will be updated with the description.

Parameters:

document (Document) – The document to describe.

Returns:

True if the document was successfully described

Return type:

bool

process_response(response, document)[source]

Process the response from OpenAI and update the document.

Parameters:
  • response (str) – The response from OpenAI

  • document (Document) – The document to update

Returns:

The updated document

Return type:

Document

describe_documents(documents=None)[source]

Describe a list of documents using OpenAI’s GPT-4o model.

Parameters:

documents (list[Document]) – The documents to describe.

Returns:

The documents with the descriptions added.

Return type:

list[Document]

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self (BaseModel) – The BaseModel instance.

  • context (Any) – The context.

Return type:

None

class paperap.scripts.describe.ArgNamespace(**kwargs)[source]

Bases: Namespace

A custom namespace class for argparse.

url: str
key: str
model: str | None = None
openai_url: str | None = None
tag: str
prompt: str | None = None
verbose: bool = False
paperap.scripts.describe.main()[source]

Run the script.

Return type:

None