How to extract paragraphs from a PDF file?

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

How to extract paragraphs from a PDF file? Wed Mar 15, 2017 11:56 am • by rbianchi
Hi,
I need to extract the summary of a report.
It is possible to extract from a pdf several paragraphs of text with the rule contents/contents match ?
Or in other way?
It is possible?
rbianchi
 
Posts: 10
Joined: Fri Jan 23, 2015 6:40 pm

While Hazel can extract text, I'm not sure if it's the best suited tool for this. There is a built-in "Summarize" service which you can read more about here: http://www.idownloadblog.com/2014/01/11 ... t-for-you/

Also, what do you want to do with the summary? Not sure if Hazel has any built-in actions to do anything with large chunks of text that you would find useful so you'd probably need scripts/Automator to do something with it.
Mr_Noodle
Site Admin
 
Posts: 11872
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Mr_Noodle, thanks for the reply,
Mr_Noodle wrote:While Hazel can extract text, I'm not sure if it's the best suited tool for this. There is a built-in "Summarize" service which you can read more about here: http://www.idownloadblog.com/2014/01/11 ... t-for-you/

Also, what do you want to do with the summary? Not sure if Hazel has any built-in actions to do anything with large chunks of text that you would find useful so you'd probably need scripts/Automator to do something with it.


I have a lot of technical papers in PDF format, and want to build a data base with Title, Authors and Summary.
I can extract the text with Hazel, but with contents/contents match = "Summary: (anything)" I only get the first paragraph, while generally the summary of the document (that have the word "Summary:" before the summary) has 3 or 4.

I also try with "Summary: (anything) (anything) (anything)", but always I get the first paragraph.
rbianchi
 
Posts: 10
Joined: Fri Jan 23, 2015 6:40 pm

The problem with (anything) (anything) (anything) is that the spaces, while matching newlines, also match spaces. So it could be grabbing three chunks of text with spaces between them.

There's not really a good consistent way of grabbing multiple paragraphs. The matching feature was meant more for grabbing smaller runs of text since they were going to be used in places like filenames which can't handle text that is too long. You might need to use a script to do that part.
Mr_Noodle
Site Admin
 
Posts: 11872
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City


Return to Support