What I would like to do is have the "Throw away if a duplicate" option not require exact file matches, but to match on extracted file contents. I expect that there would then be "Throw away if an exact duplicate" and "Throw away if extracted text matches" options and/or a global option to change the "duplicate" behavior.
Use case for this feature
I download my bank statements from the online banking site and use Hazel to rename the file and then file it away. Sometimes I go a few months between downloading them and just download a 3-5 of them at once when I remember.
Sample rules:
- Match contents based on keywords and account number to identify the rule to apply
- Match and capture statement date to use for the renaming of the file
- Rename the file to "Bank_Name-YYYY-MM-DD" based on the captured date
- Move to my folder for these records, have selected "Throw away if a duplicate" option
For some banks, they store the PDF online and I get the exact same "blob" downloaded each time, so the duplicate gets thrown away. For others, the PDF seems to be generated at the time of request, and the contents are all the same, but the files differ in some way so the checksum is different. (E.g., maybe it has some sort of embedded comment with date of generation or similar.) For those, I get duplicates being renamed even though when I open the file, I can see no difference and in Hazel's preview of matching contents, all of the text shown is identical.
Request
What I would like to request is that there be an option to make the file comparison based on the extracted text contents of the PDF and allow the duplicate detection to be triggered based on the same contents and not just 100% exact same files.