PDF "creation date" differs from spotlight "creation date"

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Hi there!
I have an issue with PDF renaming when using the "date created" variable. I have a bunch of PDF invoices and it seems that PDF metadata information differs from information in spotlight.

In Finder the creation date is:
06.01.2017

I checked Spotlight metadata using mdls command an found:
kMDItemContentCreationDate = 2017-01-06 08:09:09 +0000

When I open the PDF in Preview or Acrobat Reader, both show a different creation date, which is the correct one:
03.01.2017, 00:13

06.01.2017 is the date where I downloaded the file, but 03.01.2017 is the date where it actually has been created. Is it somehow possible to use this second, correct date (03.01.2017) from PDF metadata within Hazel?

Thanks a lot,
Paul
paul
 
Posts: 3
Joined: Fri Jan 06, 2017 4:31 am

You can check the content creation date. If you do "Get Info" in Finder, it should show almost all the info available.
Mr_Noodle
Site Admin
 
Posts: 11226
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Mr_Noodle wrote:You can check the content creation date. If you do "Get Info" in Finder, it should show almost all the info available.

I already tried that but it also returns the wrong date:
kMDItemContentCreationDate = 2017-01-06 08:09:09 +0000
:(

Creation date in PDF metadata seems to be a totally different thing. Any ideas how to get that?
paul
 
Posts: 3
Joined: Fri Jan 06, 2017 4:31 am

If you don't see it in Finder, then the only alternative would be some script that can extract that info and export it to Hazel. You can try searching the forums here but I don't recall off-hand whether someone has done something like that already.
Mr_Noodle
Site Admin
 
Posts: 11226
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

paul wrote:
Mr_Noodle wrote:You can check the content creation date. If you do "Get Info" in Finder, it should show almost all the info available.

I already tried that but it also returns the wrong date:
kMDItemContentCreationDate = 2017-01-06 08:09:09 +0000
:(

Creation date in PDF metadata seems to be a totally different thing. Any ideas how to get that?


This is an old thread but I'm resurrecting it because I've been all over the interwebs and can't find a workable solution to this problem.

PROBLEM: I want to download my bank statements instead of waiting for the USPS delivery and then laboring though the process of scanning them and OCR'ing.

In my current paperless workflow, I manually prepend the statement date in front of the file name and let Hazel change the file creation date using a shell script, to the date in the file name, 2019-12-25--filename.pdf.

But now I want to cut out the US Mail and scanning with ScanSnap, and OCR and all that. My bank now offers a pretty cool download feature for all my documents. What if I could simply download my statements, have them renamed based on the PDF metadata "Creation Date", and import them into EagleFiler? That would be cool and save paper too.

Hazel seems unable to read PDF meta data, correct me if I'm wrong. But my testing shows that the Spotlight Creation Date is NOT the same as the date inside the PDF file that shows when you show the Preview file inspector. There it displays a metadata "Creation Date", that is the date when my bank issued the statement, not the date I downloaded the file.

TASK: I want to read the creation date from within the PDF file, and rename the file in a format like: 2019-12-22--Filename.pdf

I discovered some nifty Applescripts and shell scripts that will rename a file based on Spotlight searched data. But the data do NOT reflect the date in the PDF file metadata. Instead, the Spotlight creation date seems to equal the download date.
Here's an example:

Code: Select all
set theFiles to choose file with multiple selections allowed
repeat with theFile in theFiles
   set filePath to quoted form of POSIX path of theFile
   set theName to do shell script "mdls -name kMDItemFSCreationDate " & filePath & " -raw"
   set theName to theName & ".pdf"
   if theName is not "(null)" then try
      tell application "System Events" to set name of theFile to theName
   end try
end repeat


But this code, as I say, does not equal the date in the PDF meta data.

Then I found that Apple's Automator has a promissing PDF Action called "Get PDF Metadata". And it include the Creation Date. But I was unable to create an Automator action that would rename my file because I'm unfamiliar with Automator and thus inadequate to that task.

I suspect this renaming capability based on PDF metadata is a feature that many would want. Especially as more and more banks and online agencies move towards downloaded documentation.

I want to set Hazel to monitor my Downloads folder and automatically rename every PDF with a prepended Creation Date, extracted from the pdf metadata.

My paperless Hazel workflow rules already change the file creation date to the date prepended in the file name, using a shell script. So this preliminary download - file naming correction would be very helpful in keeping my files chronologically correct.

I could accept an Automator rule that watches a folder (Downloads), and renames all PDFs with a prepended PDF Creation Date. But it would be much more elegant if Hazel could do this for me.

Did I miss something, is this already doable. I've been looking all evening for a way.
Searching for: "How to rename a PDF file on a MAC based on metadata"

Please help if you can...

Cheers,

-DaveB

Any comments or suggestions?
DaveB
 
Posts: 25
Joined: Sat Apr 02, 2016 7:56 pm

Hi DaveB,

I run a number of workflows monthly where I download credit card, bank, utility, and other statements which are automatically entered into Evernote. Some requiring OCR are sent to a folder where that is done using using PDFpenPro; if account number redaction is necessary, the files is moved to a folder where that is accomplished again by PDFpenPro...all controlled by Hazel.

In the final step, Hazel does a date match on the statement or invoice date within the file. I have no interest in file metadata...I key on the statement or invoice date. See my Hazel file below for my bank statement:

Image

See the AppleScript for moving the file to Evernote below. Dates are set to match the invoice or statement date so they sort properly in Evernote.

Code: Select all
tell application "Evernote"
   
   set note1 to create note from file theFile notebook {"XYZ Bank"} tags {"XYZ Bank"}
   set (creation date of note1) to item 1 of inputAttributes
   set (modification date of note1) to (creation date of note1)
   
end tell

NOTE: DateMatch needs to be set in InputAttributes for this script.

Any questions...just ask.

Good luck...

Ed
EdBurke
 
Posts: 28
Joined: Wed Apr 18, 2012 6:58 am

EdBurke wrote:In the final step, Hazel does a date match on the statement or invoice date within the file. I have no interest in file metadata...I key on the statement or invoice date. See my Hazel file below for my bank statement:


Thanks for sharing your process Ed. I have a process for scanning and OCR'ing, using ScanSnap and ABBYY FineReader. I have about 50 Hazel rules set to match contents using "Contains Match", to match account numbers. Then it changes the file system's date created using a shell script, based on a manually prepended date in the filename, and moves the file to a folder in EagleFiler.

Here's that shell script:

Code: Select all
# Take filename, YYYY-MM-DD--New Name.pdf and
# - adjust the creation date using /usr/bin/setfile -d
# - adjust the modification date using /usr/bin/touch
# - rename the file to New Name.pdf using /bin/mv
 
filename_without_path=$(basename "$1")
extension="${filename_without_path##*.}"
filename_without_extension_or_path="${filename_without_path%.*}"

YYYY=$(echo "$filename_without_extension_or_path" | awk -F "-" '{print $1}')
MM=$(echo "$filename_without_extension_or_path" | awk -F "-" '{print $2}')
DD=$(echo "$filename_without_extension_or_path" | awk -F "-" '{print $3}')
NNAME=$(echo "$filename_without_extension_or_path" | awk -F "--" '{print $4}')

# Change creation date
if [[ ${filename_without_path:0:1} == "2" ]] ; then   /usr/bin/setfile -d "$MM/$DD/$YYYY" "$1"
fi
# Change modification date to Today/Now
# /usr/bin/touch "$1"
# Rename file
# /bin/mv "$1" "$NNAME"


This has all been working well for several years. I will look at tweaking that process to include your suggestion to look inside each file for a date rather than do that manually, as I do today.

More recently I've also developed (with help from the forums) an AppleScript to directly import into EagleFiler from Hazel, to leverage its ability to check for duplicates using checksums. The new EagleFiler import AppleScript requires a modification for each destination folder, so is a bit more to maintain as new rules are needed. But it works. Here's that script:

Code: Select all
-- set _folderName to item 1 of _inputAttributes (not useful in my case)
-- Adjust _parentFolderName and _folderName to match the destination folders in EagleFiler

set _parentFolderName to "Documents"
set _folderName to "Banking-Financial"

-- Change these to the library that you want to use
set _libraryName to "Paperless-Library_local.eflibrary"
set _libraryFolder to "/Users/dbartholomew/Documents/Paperless-Library_local"
set _libraryPath to "/Users/dbartholomew/Documents/Paperless-Library_local/Paperless-Library_local.eflibrary"
-- set _libraryPath to _libraryFolder & _libraryName
set _libraryFile to POSIX file _libraryPath

tell application "EagleFiler"
   open _libraryFile
   tell library document _libraryName
      set _folderRecord to library record _folderName of library record _parentFolderName of root folder
      set {_record} to import files {theFile} container _folderRecord
   end tell
end tell


But my original post here is about something different. Since more and more of my documents are being offered online by almost every bank, utility and cell phone provider, I can download them directly, in electronic PDF format, with full text fidelity and no need to OCR them. In every case, these online, downloadable PDF documents already have a date in the PDF metadata, called a "Creation Date", contained within the PDF file and viewable using the "Inspector" in Mac's Preview application. These metadata are consistent among all PDF's, so one rule or script would work for all PDF files.

[EDIT 2020-01-01]. I've found that not all PDFs have a creation date. And Some PDFs have a creation date based on the day you download the file... Oh well.

It would be elegant to be able to leverage these data in a rule or process via Hazel. As I mentioned above, Apple's Automator might solve this problem if I can figure out how to create a folder automation for my Downloads folder, that renames every PDF dropped in that folder, by inspecting the PDF for the "Creation Date" and perhaps the "Title", and prepending the date to the filename in a format like this: 2019-12-26--filename.pdf. That's the only way I've discovered so far, to do this today. (And I haven't yet discovered how to use Automator to do that.) But if Hazel could include a feature to inspect PDF metadata internally (or via a script), that would be even simpler.

I think a lot of folks would want this capability... (Or maybe such an ability already exists? Which is why I'm posting this here.)
Last edited by DaveB on Thu Jan 02, 2020 1:21 am, edited 1 time in total.
DaveB
 
Posts: 25
Joined: Sat Apr 02, 2016 7:56 pm

There are tools which can do this.

However, they are not included in macOS by default. There are two options.

1. You could install one of the tools designed for working with PDFs.
2. You could try to "roll your own" solution

Use A Tool

If you want to use the first option, I would suggest installing https://poppler.freedesktop.org which includes a great collection of PDF tools, including `pdfinfo` which is the one we want in this case.

pdfinfo 'FILENAME.pdf' | awk -F' ' '/^CreationDate:/{print $NF}'

Note that in the awk -F' ' there are 3 spaces between the first ' and the last

That will give you a date such as "Mon Nov 18 22:29:07 2013 EST"

Roll Your Own

The option to roll your own can be attractive because it doesn't require any other software, but then you have do deal with unknowns. For example, in very limited testing I have already found there are at least two ways of showing the Creation Date in the PDF metadata.

Here is an old one:

/CreationDate (D:20131119032907+00'00')

Here is a new one:

<xmp:CreateDate>2019-07-17T14:29:50+04:00</xmp:CreateDate>

Note that "pdfinfo" can deal with both of these variations, and presumably any others that might exist.

Also note that these timestamps are not localized (that is, they are not showing your local time zone), so you'd need to parse them (the formats are different) and then "translate" them to your time zone. Again, "pdfinfo" does this automatically.

So I'd be a strong advocate for not trying to roll your own, but instead use the tool which has already been made for this purpose. If you use Brew you can install poppler with a simple `brew install poppler` and then you'll have this and several other tools at your disposal.

My Solution

I wrote a shell script which will use `pdfinfo` to look for the creation date and then will rename the file accordingly.

https://gist.github.com/tjluoma/205e3d85e46eb6025b87c6db5b77375b

If you use

pdf-rename-by-cdate.sh FILENAME.pdf

then "FILENAME.pdf" will be renamed "FILENAME (YYYY-MM-DD).pdf"

For example, 'ccc5-documentation-en.pdf' became 'ccc5-documentation-en (2019-07-17).pdf'

The script tries to determine if the filename already has the date in it, and if it does, it will not rename it, so you won't end up with "FILENAME (YYYY-MM-DD) (YYYY-MM-DD).pdf"

If "FILENAME (YYYY-MM-DD).pdf" already exists when another file tries to be renamed to that, the script will rename it "FILENAME (YYYY-MM-DD) 1.pdf" instead, and it that exists will try 2, 3, 4, etc. until it find an available filename.

I hope this helps. If there are questions, please feel free to ask (either here or leave a comment on the gist) and I'll do my best to answer.
luomat
 
Posts: 78
Joined: Wed Mar 10, 2010 3:57 pm

Well here I am replying to my own post. Thank you luomat for your suggestions.

I had some time over the holidays and I've developed a solution that works for me.
The goal is to automatically rename PDF files downloaded to the Downloads folder with a prepended reverse date, "2020-01-01--" extracted from the metadata "Creation Date" seen in the PDF inspector. (This is NOT the same as the Spotlight metadata).

Background:
My Hazel rules all process files that contain a prepended date. The Hazel rules use this prepended date to change the file creation date via a shell script, as files are imported by Hazel as part of my paperless workflow, into EagleFiler paperless library.

My manual process has been to scan and OCR, and then manually open each file, review the date, and prepend the filename with that date in that specific format. Most of my documents don't lend themselves to letting Hazel figure out the creation date using content matching.

But lately, I've also begun to download a lot of billing and bank statements online in PDF format. I wanted to save time by programmatically inspecting the newly downloaded PDF files, and conjure up a date created from the metadata within each PDF file.

I found open source, external applications that can do this inspection and extraction of the PDF "creation date". But I was reluctant to install binaries out of an abundance of caution over security. I almost did install one, but my Mac asked me if I was sure I wanted to do that, and warned me against it. So I backed off.

As I've been researching this issue I also came to understand that the Mac includes a native ability to inspect PDF metadata as part of the Automator app. So this week I've spent some considerable quality time learning about Automator, JavaScript for Applications (JXA), AppleScripting and a bit of Shell scripting.

In the end, I now have a reasonable "Proof of Concept" for a Folder Automation that's assigned to my Downloads folder.
Here's what it does:

Watch for new files in Downloads.
Pause for 4 seconds (to allow downloads to complete)
Filter so only PDF files are acted upon
Run a Javascript to parse the metadata "Creation Date", or optionally use today's date if that's empty
Reformat the Creation date from "Jan 1, 2020" to "2020-01-01"
Move the file out of Downloads, to a destination folder
Prepend that date by renaming the file as "2020-01-01--filename.pdf"

All this using Apple's JXA JavaScript for Applications (something completely new to me, and remarkably poorly documented online).

Here's a screenshot from Automator.

Image

And here is the JavaScript code:

Code: Select all
function run(input, parameters) {
   
   var app = Application.currentApplication();
      app.includeStandardAdditions = true;
   var finder = Application("Finder");
   var systemEvents = Application("System Events");
   var filename = input[0];
   var tmpFilename = filename.toLowerCase();
   //app.displayDialog("filename is " + filename);
   
   /* ================================================================
   Input (from Get PDF Metadata) :
      2019-12-25-My Visa Statement.pdf
      Dec 22, 2019 at 9:39:25 PM
   Converts to:
      2019-12-22--2019-12-25-My Visa Statement.pdf
   =================================================================== */

   try {
      var ndex = tmpFilename.lastIndexOf(".pdf");
      var filenameAlone = filename.substring(0, ndex+4) //The fifth character is a tab
      var nat;
      
      cDate = filename.substring((ndex+5), filename.length);
      nat = cDate.lastIndexOf(" at ");
      cDate = cDate.substring(0,(nat)) + " 00:00:00"; // we don't care about the time but the date object requires it.
      
      //app.displayDialog("cDate + input = " + cDate + " | " + input)
   
   
      //we strippped off the filename in front and the tab character after .pdf and the time after the at.
      //var dT = new Date(cDate).toString();  //used while debugging to display JavaScript date object as a string
      var dT
      if (nat == -1) { // Some PDFs do'nt have a creation date in their metadata.
         dT = new Date();
      } else {
         dT = new Date(cDate);
      }
      var aaaa = dT.getFullYear();
      var gg = dT.getDate();
      var mm = (dT.getMonth() + 1);
      if (gg < 10) { gg = "0" + gg;}
      if (mm < 10) { mm = "0" + mm;}   
      var cDate = aaaa + "-" + mm + "-" + gg;
      
      // app.displayDialog("cDate now equals " + cDate);
      // our date now looks like this: 2020-01-01
      
      var newFileName = cDate + "--" + filenameAlone;
      // debugging ...
      // app.displayDialog("cDate is " + cDate + " filenameAlone is " + filenameAlone);      
 
      var sFile= "~/Downloads/" + filenameAlone;
      // BTW, JXA doesn't seem to like paths with ~/ tildes in them

/* ===================================================================
   First we have to move the file out of the watched Downloads folder
   Then rename it, or else looping occurs when the Automator
   Folder Action gets triggered by the continuing name changes.
====================================================================== */
/* it would be nice if we didn't have to hard code the paths...
 
   var homeDirectory = finder.startupDisk.folders["Users"].folders["user"]
   var sourceFile = homeDirectory.folders["Downloads"].files[filenameAlone]
   var destinationFolder = homeDirectory.folders["Destination"]
   app.displayDialog("homeDirectory = " + homeDirectory);
*/
   var sourceFile = "/Users/dbartholomew/Downloads/" + filenameAlone;
   var destinationFolder = "Users/dbartholomew/Desktop/Scan Target/";

// This is the secret sauce... Poorly documented JXA code. It was hard to discover

   finder.move(Path(sourceFile), {
    to: Path(destinationFolder),
    replacing: true
   })

   theFile = systemEvents.aliases.byName(destinationFolder + filenameAlone);   
        theFile.name = newFileName;  // Rename the file
   //app.displayDialog("sFile = " + sFile + "\n : newFileName = " + newFileName);

   } catch (error) {
   //console.log("Error - File Not Found" + error);
   app.displayDialog("Oh Bummer : " + error);
   }
//app.displayDialog("cDate + input = " + cDate + " | " + input)
return input;
}


I don't mind that the JavaScript renames the file to append a reverse date to a filename that already contains a reversed date because hazel Rules will rename the file on import based on content matching.
I.E. 2019-12-22--2019-12-25-My Visa Statement.pdf will become 2019-12-22--Daves Signature Visa Statement.pdf
The rule implements a shell script that only looks for the prepended date "2019-12-22--" and ignores the rest of the filename. Bottom line is the filename will be cleaned up in the next step in the workflow.

More work could be done to inspect the filename, and if a date is present in the filename AND the PDF metadata are empty of a date, or the PDF metadata "Creation Date" is newer than the date contained in the filename... then use the date in the filename instead of today's date, as I'm doing here. I've found that some banks don't create the PDF statement until I ask to download it, providing a PDF creation date of today. While other banks generate the PDF on the statement date, and thus contain a Creation Date equal to the statement date. And still other banks (ok my lame Credit Union), has statements with no date at all in the metadata...but the filename contains the statement date.

This work represents several days of tinkering. I'm sure I will refine it over time. I'm new to the tools that I used. But it works!

Hope others find it useful too.

Happy New Year 2020!

-Dave

[Update Jan 5, 2020]. With some use of this new technique for renaming PDFs I've discovered that many of my online statement providers don't create the PDF statement until the moment I request the download. So the date in the PDF metadata is not always the original statement date. I also discovered that Hazel has a fantastic ability to find dates within my PDFs (Thanks EdBurke). (I can't believe I never knew Hazel can do this "Custom Date" date extraction... so cool.) So I've updated my workflow to use the custom date matching to extract the date, then reformat that date as 2020-01-05 for prepending many of my statements and invoices filenames. Here's an article I found that describes how to use Custom Date match http://www.myproductivemac.com/blog/date-matching-in-hazel2992015

The JavaScript-Automator technique I've illustrated above is still useful for anyone who needs an automator folder action that can rename PDFs to the metadata TITLE, rather than the Creation Date. Many online documents are given machine generated filenames that when downloaded, will be better off renamed to the PDF metadata TITLE. Just adjust Automator to extract the Title from the metadata and my script to rename the PDF file accordingly.
DaveB
 
Posts: 25
Joined: Sat Apr 02, 2016 7:56 pm


Return to Support

cron