[tesseract-ocr] Fine-turning LSTM for Japanese

Discussion:

Akira Hayakawa

2017-05-28 11:23:56 UTC

I am new to tesseract. My aim is to use this software to analyze Japanese
doc. The idea in my mind is to start from existing model and fine-tune it
by new words that weren't correctly recognized.

I am reading the Wiki and have some questions.

1)

In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

you add training_text to tesstrain.sh

training/tesstrain.sh \

--fonts_dir /usr/share/fonts \
--training_text ../langdata/ara/ara.training_text \
--langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--lang ara \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fontlist "Arial" \
--output_dir ~/tesstutorial/aratest

but

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

You don't. Why?

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
--linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

My understanding is

1. tesstrain.sh uses text2image command internally to generate images which
are in various fonts and reshaped.
2. --linedata_only splits the training text into line and makes images for
each line.
3. langdata_dir is essential but training_text isn't. If training_test
isn't found, it uses the default $lang/$lang.training_text.

Am I correct?

2)

In the above example, I couldn't have an idea why it should take --tessdata
because it seems irrelevant to making training data.

3)

In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

It says the reader should place each projects like this

./langdata

./langdata/eng
./langdata/ara
./tessdata
./tesseract
./tesseract/tessdata
./tesseract/tessdata/configs/
./tesseract/training
etc

and all the following examples are run under tesseract directory. Then I
think the examples should take ../tessdata as --tessdata_dir but
./tessdata. I mean the examples should be fixed.

4)

In In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

combine_tessdata -e ../tessdata/ara.traineddata \

~/tesstutorial/aratuned_from_ara/ara.lstm

This is explained as it extracts the existing LSTM model for Arabic from
tessdata but how come?
The combine_tessdata commands extracts LSTM model because the extension of
the second parameter is .lstm?

Another question here is why LSTM model is mixed in the traineddata? I
think the traineddata file mixes legacy trained model and LSTM model and I
am wondering why they aren't separated? Even if the user only uses LSTM
both trained model are read? (is it light-weight? then it might be ok)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

2017-05-28 18:14:21 UTC

Permalink

Post by Akira Hayakawa
I am new to tesseract. My aim is to use this software to analyze Japanese
doc. The idea in my mind is to start from existing model and fine-tune it
by new words that weren't correctly recognized.
I am reading the Wiki and have some questions.
1)
In https://github.com/tesseract-ocr/tesseract/wiki/
TrainingTesseract-4.00---Finetune
you add training_text to tesstrain.sh
training/tesstrain.sh \

âYes, you are correct.â

Post by Akira Hayakawa
2)
In the above example, I couldn't have an idea why it should take
--tessdata because it seems irrelevant to making training data.

âtesseract needs eng and osd traineddata during initialization. The
location can be specified via TESSDATA_PREFIX also.â

Post by Akira Hayakawa
3)
In https://github.com/tesseract-ocr/tesseract/wiki/
TrainingTesseract-4.00---Finetune
It says the reader should place each projects like this
./langdata

./langdata/eng
./langdata/ara
./tessdata
./tesseract
./tesseract/tessdata
./tesseract/tessdata/configs/
./tesseract/training
etc

âThat will be the directory structure if you were to clone the tesseract,
langdata and tessdata repositories.

It is not recommended to clone the whole tessdata repo (over 1 gb), you can
download the traineddata files for the languages you need.â

Post by Akira Hayakawa
and all the following examples are run under tesseract directory. Then I
think the examples should take ../tessdata as --tessdata_dir but
./tessdata. I mean the examples should be fixed.

â./tessdata (in tesseract repo) does not have any traineddata files to
begin with.

You can change the directories to match your directory configuration.â

Post by Akira Hayakawa
4)
In In https://github.com/tesseract-ocr/tesseract/wiki/
TrainingTesseract-4.00---Finetune
combine_tessdata -e ../tessdata/ara.traineddata \

~/tesstutorial/aratuned_from_ara/ara.lstm

âYes.â

Post by Akira Hayakawa
Another question here is why LSTM model is mixed in the traineddata? I
think the traineddata file mixes legacy trained model and LSTM model and I
am wondering why they aren't separated? Even if the user only uses LSTM
both trained model are read? (is it light-weight? then it might be ok)

âThe 4.0 code is in alpha stage of testing and supports both legacy engine
and new LSTM engine and the traineddata file has both models.

You can use combine_tessdata to keep only the LSTM model in the traineddata.
â
--

Post by Akira Hayakawa
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%
40googlegroups.com
<https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVqHs9HeBisZm2ikPBN8tnbbaqYrpjg0U0pG6%3DqDYAnDQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Akira Hayakawa

2017-05-29 03:49:35 UTC

Permalink

Thanks for the reply. I understand.

There are couple of questions related to this topic.

1)

training_text may only include the text for the next (or new) learning?
For example, the LSTM net have learned a line "I have a pen" and we need it
to learn a line "I have a pineapple" then does training_text only include
the pineapple line but the pen line is removed?

2)

In
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh

the files in langdata other than training_text are said to be optional.
I suppose these files are internally handled as hints. Am I right?
And what if these files are inconsistent with training_text? For example,
wordlist may contain fairly irrelevant words.
Should I erase the optional files if they are inconsistent?

3)

Closely related to 2).
When the langdata doesn't have these optional files. Tesseract internally
generates the files from training_text?

4)

Is there no way to fine-tune legacy tesseract?

5)

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

NOTE Tesseract 4.00 will now run happily with a traineddata file that
contains just lang.lstm.The lstm-*-dawgs are optional, and none of the
other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. No
bigrams, unichar ambigs or any of the other files are needed or even have
any effect if present.

Does this mean if we use LSTM only (legacy tesseract is going to be purged
in the future release right?), the optionals files like wordlist are
entirely needless? This sounds natural to me because as far as I understand
the LSTM net only learn a text line from a sequence of byte or image.
btw, What does "dawgs" mean?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.