Akira Hayakawa
2017-05-28 11:23:56 UTC
I am new to tesseract. My aim is to use this software to analyze Japanese
doc. The idea in my mind is to start from existing model and fine-tune it
by new words that weren't correctly recognized.
I am reading the Wiki and have some questions.
1)
In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
you add training_text to tesstrain.sh
training/tesstrain.sh \
In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
You don't. Why?
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
--linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
My understanding is
1. tesstrain.sh uses text2image command internally to generate images which
are in various fonts and reshaped.
2. --linedata_only splits the training text into line and makes images for
each line.
3. langdata_dir is essential but training_text isn't. If training_test
isn't found, it uses the default $lang/$lang.training_text.
Am I correct?
2)
In the above example, I couldn't have an idea why it should take --tessdata
because it seems irrelevant to making training data.
3)
In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
It says the reader should place each projects like this
./langdata
think the examples should take ../tessdata as --tessdata_dir but
./tessdata. I mean the examples should be fixed.
4)
In In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
combine_tessdata -e ../tessdata/ara.traineddata \
tessdata but how come?
The combine_tessdata commands extracts LSTM model because the extension of
the second parameter is .lstm?
Another question here is why LSTM model is mixed in the traineddata? I
think the traineddata file mixes legacy trained model and LSTM model and I
am wondering why they aren't separated? Even if the user only uses LSTM
both trained model are read? (is it light-weight? then it might be ok)
doc. The idea in my mind is to start from existing model and fine-tune it
by new words that weren't correctly recognized.
I am reading the Wiki and have some questions.
1)
In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
you add training_text to tesstrain.sh
training/tesstrain.sh \
--fonts_dir /usr/share/fonts \
--training_text ../langdata/ara/ara.training_text \
--langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--lang ara \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fontlist "Arial" \
--output_dir ~/tesstutorial/aratest
but--training_text ../langdata/ara/ara.training_text \
--langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--lang ara \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fontlist "Arial" \
--output_dir ~/tesstutorial/aratest
In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
You don't. Why?
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
--linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
My understanding is
1. tesstrain.sh uses text2image command internally to generate images which
are in various fonts and reshaped.
2. --linedata_only splits the training text into line and makes images for
each line.
3. langdata_dir is essential but training_text isn't. If training_test
isn't found, it uses the default $lang/$lang.training_text.
Am I correct?
2)
In the above example, I couldn't have an idea why it should take --tessdata
because it seems irrelevant to making training data.
3)
In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
It says the reader should place each projects like this
./langdata
./langdata/eng
./langdata/ara
./tessdata
./tesseract
./tesseract/tessdata
./tesseract/tessdata/configs/
./tesseract/training
etc
and all the following examples are run under tesseract directory. Then I./langdata/ara
./tessdata
./tesseract
./tesseract/tessdata
./tesseract/tessdata/configs/
./tesseract/training
etc
think the examples should take ../tessdata as --tessdata_dir but
./tessdata. I mean the examples should be fixed.
4)
In In
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
combine_tessdata -e ../tessdata/ara.traineddata \
~/tesstutorial/aratuned_from_ara/ara.lstm
This is explained as it extracts the existing LSTM model for Arabic fromtessdata but how come?
The combine_tessdata commands extracts LSTM model because the extension of
the second parameter is .lstm?
Another question here is why LSTM model is mixed in the traineddata? I
think the traineddata file mixes legacy trained model and LSTM model and I
am wondering why they aren't separated? Even if the user only uses LSTM
both trained model are read? (is it light-weight? then it might be ok)
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.