Trouble with date matching and variable spaces

I was setting up a rule for some 401k statements in searchable PDFs. I wanted to grab the ending statement date to use in the file name.

The contents of the statement include "Retirement Savings Account" followed by the statement period in a format such as "May 1, 2013 - June 30, 2013".

I have the rule with a condition of "Contents contain match". The match is "(open date token)(anything)-(anything)(close date token)" The dates are in the format of "June 20, 2013". The (anything) tokens are to take care of any OCR variability as to whether the OCR identified spaces before and after the dash or whether there are no spaces.

I also have a couple other conditions based on the content containing key phrases not using match that just verify that it is the 401k statement in general.

I got it to successfully match one file, but it stumbled with another scan of the same institution's statement. The problem came down to the fact that the OCR text in one date did not have a space after the common, i.e. June 20,2013".

If I take the space out of the date token, then it matches that file, but it won't match other files where the date was OCR'd with the space.

Any ideas whether there's a date match rule that can cover such variances or a better way to do what I'm doing? There is no (anything) token in the date match itself, so I can't insert that instead of a space after the comma.

I'm considering adding the "anything" token to the date pattern editor as someone else noted a similar issue with OCRing where dirt and such can result in extra punctuation. In the meantime, you can try using separate date tokens for each part of the date. Like:

(•date month)(...)(•date day)(...)(•date year)

where each date token just matches one part of the date (date, day or year, in this case). When you only specify part of a date in the token, the rest is filled in with the current date. But when you re-use those tokens, you can have it only output the part of the date you read in. So, for (•date month), which only matches a month, when outputting, you can set its format to only output the month part. Hopefully that makes sense.

Not ideal but just one thing to consider until I can add the proper support in a future release.

I'll have to give that a try. So it would be like (open month), (open day), (open year) and (close month), (close day), (close year), then use whichever set of token gives me the date I want in the final product.

BTW, it looks like the "anything" token will make it into the next patch. So, if you are willing to hang on for a few days (still tracking down other bugs), then you'll have that feature available.