Page 1 of 1

How to extract paragraphs from a PDF file?

PostPosted: Wed Mar 15, 2017 11:56 am
by rbianchi
Hi,
I need to extract the summary of a report.
It is possible to extract from a pdf several paragraphs of text with the rule contents/contents match ?
Or in other way?
It is possible?

Re: How to extract paragraphs from a PDF file?

PostPosted: Thu Mar 16, 2017 11:14 am
by Mr_Noodle
While Hazel can extract text, I'm not sure if it's the best suited tool for this. There is a built-in "Summarize" service which you can read more about here: http://www.idownloadblog.com/2014/01/11 ... t-for-you/

Also, what do you want to do with the summary? Not sure if Hazel has any built-in actions to do anything with large chunks of text that you would find useful so you'd probably need scripts/Automator to do something with it.

Re: How to extract paragraphs from a PDF file?

PostPosted: Fri Mar 17, 2017 7:58 am
by rbianchi
Mr_Noodle, thanks for the reply,
Mr_Noodle wrote:While Hazel can extract text, I'm not sure if it's the best suited tool for this. There is a built-in "Summarize" service which you can read more about here: http://www.idownloadblog.com/2014/01/11 ... t-for-you/

Also, what do you want to do with the summary? Not sure if Hazel has any built-in actions to do anything with large chunks of text that you would find useful so you'd probably need scripts/Automator to do something with it.


I have a lot of technical papers in PDF format, and want to build a data base with Title, Authors and Summary.
I can extract the text with Hazel, but with contents/contents match = "Summary: (anything)" I only get the first paragraph, while generally the summary of the document (that have the word "Summary:" before the summary) has 3 or 4.

I also try with "Summary: (anything) (anything) (anything)", but always I get the first paragraph.

Re: How to extract paragraphs from a PDF file?

PostPosted: Fri Mar 17, 2017 11:08 am
by Mr_Noodle
The problem with (anything) (anything) (anything) is that the spaces, while matching newlines, also match spaces. So it could be grabbing three chunks of text with spaces between them.

There's not really a good consistent way of grabbing multiple paragraphs. The matching feature was meant more for grabbing smaller runs of text since they were going to be used in places like filenames which can't handle text that is too long. You might need to use a script to do that part.