I often have to scan Invoices with my Canon RS40 (highly recommended!) for bookkeeping reasons and I go crazy by renaming the files and figuring out what is what. I like to have my files sorted with a filename scheme YYYY-MM-DD-VENDOR, like 2024-02-24-Apple. That's why I wrote two Hazel Rules:
First rule is on the folder where the Scans are sent by Canon: it checks if Type is PDF and if the name matches the auto generated name by Canon.
Then I run first an OCR Script using ocrmypdf to use full text search via Mac's Spotlight:
- Code: Select all
eval "$(/opt/homebrew/bin/brew shellenv)"
BASENAME=`basename -s .pdf "$1"`
ocrmypdf -v 1 -l deu+eng --output-type pdf --rotate-pages --continue-on-soft-render-error --redo-ocr "$1" "$1" >>/dev/null 2>&1 && exiftool -q -overwrite_original_in_place -all:all= $1 && say "OCR successful"
After that, my second script runs, which sends the invoice to AWS Textract using the AWS CLI:
https://gist.github.com/thomaswitt/c165 ... 71cef62a54
Before running the script, you have to create a profile "textract" (in ~/.aws/credentials).
Then the CLI takes the first page of the file via qpdf (textract can only handle a single PDF page when synchronous streaming data to AWS - which is fine for typical invoices), The JSON result gets parsed to JQ, the results get sorted by the most likely match for the receipt date (some ruby date parsing involved) and for the vendor name. Then the file is renamed accordingly. The result is spoken via "say", because I can stand besides my scanner and scan one invoice after another without looking at the screen.
Hope that helps somebody who had the same problems like I do.