[tesseract-ocr] PDF woes: "pixRead: image file not found error"

Discussion:

John Muccigrosso

2016-06-24 00:12:13 UTC

Recently installed tesseract and am having some trouble with PDFs. The
error is some form of:

Error in fopenReadStream: file not found
%ï¿œï¿œï¿œï¿œ in pixRead: image file not found: %PDF-1.3
%ï¿œï¿œï¿œï¿œ cannot be read!
Error during processing.

where the 1.3 may be 1.4 or 1.6. Things are fine with a jpg or tiff version
of the same PDF (created by exporting from Preview.app).

System: Mac OS X 10.9.5.
"tesseract -v" reports:

tesseract 3.04.01
leptonica-1.72
libjpeg 8d : libpng 1.6.23 : libtiff 4.0.6 : zlib 1.2.5

I installed tesseract and leptonica with homebrew and "brew info
tesseract" reports:

tesseract: stable 3.04.01 (bottled), HEAD
OCR (Optical Character Recognition) engine
https://github.com/tesseract-ocr/
/usr/local/Cellar/tesseract/3.04.01_1 (93 files, 39.5M) *
Poured from bottle on 2016-05-27 at 15:41:15
From: https:
//github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb
==> Dependencies
Required: leptonica â
Recommended: libtiff â
==> Options
--with-all-languages
Install recognition data for all languages
--with-opencl
Enable OpenCL support
--with-training-tools
Install OCR training tools
--without-libtiff
Build without libtiff support
--HEAD
Install HEAD version

I suspect some missing package or something similar, but don't know what
exactly.

TIA.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/de320a67-b788-4263-8486-a522c556051c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Quan Nguyen

2016-06-24 01:09:53 UTC

Permalink

Tesseract cannot read PDF (which is a document format) directly. You'll
need to convert it to an image format first.

Post by John Muccigrosso
Recently installed tesseract and am having some trouble with PDFs. The
Error in fopenReadStream: file not found
%ï¿œï¿œï¿œï¿œ in pixRead: image file not found: %PDF-1.3
%ï¿œï¿œï¿œï¿œ cannot be read!
Error during processing.
where the 1.3 may be 1.4 or 1.6. Things are fine with a jpg or tiff
version of the same PDF (created by exporting from Preview.app).
System: Mac OS X 10.9.5.
tesseract 3.04.01
leptonica-1.72
libjpeg 8d : libpng 1.6.23 : libtiff 4.0.6 : zlib 1.2.5
I installed tesseract and leptonica with homebrew and "brew info
tesseract: stable 3.04.01 (bottled), HEAD
OCR (Optical Character Recognition) engine
https://github.com/tesseract-ocr/
/usr/local/Cellar/tesseract/3.04.01_1 (93 files, 39.5M) *
Poured from bottle on 2016-05-27 at 15:41:15
From: https://
github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb
==> Dependencies
Required: leptonica â
Recommended: libtiff â
==> Options
--with-all-languages
Install recognition data for all languages
--with-opencl
Enable OpenCL support
--with-training-tools
Install OCR training tools
--without-libtiff
Build without libtiff support
--HEAD
Install HEAD version
I suspect some missing package or something similar, but don't know what
exactly.
TIA.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9981de31-434e-4c7f-a184-e55af1833ec0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Muccigrosso

2016-06-24 14:27:52 UTC

Permalink

Post by Quan Nguyen
Tesseract cannot read PDF (which is a document format) directly. You'll
need to convert it to an image format first.

Ugh, of course. Thanks!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dca78ff6-3dac-4127-ae03-e8879a651973%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.