Georg Sauthoff
2017-10-15 19:29:30 UTC
Hello,
for some documents it would make sense to create a text-only PDF with
tesseract (cf. -c textonly_pdf=1) and merge it with an image-only PDF; as
described in
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#integrate-original-image-file-and-detected-text-into-pdf
and the linked github issue comment.
Use-case: let tesseract do its OCR on very high-quality images but put some
post-processed images into the resulting PDF file. Thus, you get high
quality OCR results and a relatively small PDF file.
So the ansatz described in the FAQ/issue sounds nice, but how do I actually
merge the 2 PDF files (on Linux)?
When googling for PDF merge tools I just find ones for concatenating PDF
files ...
For the above merge the 2 PDF files have to be merged 'on top' of each
other, i.e. the number of pages of the resulting PDF doesn't change, it
'just' gets the text layer added.
Best regards
Georg
for some documents it would make sense to create a text-only PDF with
tesseract (cf. -c textonly_pdf=1) and merge it with an image-only PDF; as
described in
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#integrate-original-image-file-and-detected-text-into-pdf
and the linked github issue comment.
Use-case: let tesseract do its OCR on very high-quality images but put some
post-processed images into the resulting PDF file. Thus, you get high
quality OCR results and a relatively small PDF file.
So the ansatz described in the FAQ/issue sounds nice, but how do I actually
merge the 2 PDF files (on Linux)?
When googling for PDF merge tools I just find ones for concatenating PDF
files ...
For the above merge the 2 PDF files have to be merged 'on top' of each
other, i.e. the number of pages of the resulting PDF doesn't change, it
'just' gets the text layer added.
Best regards
Georg
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/db824e06-989a-49b1-bda9-54af546570cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/db824e06-989a-49b1-bda9-54af546570cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.