[tesseract-ocr] How you can train tesseract 4.0 LSTM for receipts

Ahmad Moawad

2017-06-04 18:36:35 UTC

Hello All,

I want to train tesseract 4.0 LSTM for receipt, So what I am asking related
to:

1. Training based on image
2. Image processing
3. Add new words to the dictionary

- I have read the documentation and I think the good option is:
*Finetune*. So I need to provide box/tiff before training.

- I know this command will create box file in under directory in /tmp,
So should I edit the box file here or edit and provide it to this command
in this case how can I provide it to this command.

training/tesstrain.sh \
--fonts_dir /usr/share/fonts \
--training_text ../langdata/ara/ara.training_text \
--langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--lang ara \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fontlist "Arial" \
--output_dir ~/tesstutorial/aratest

- for the image processing I am using the libraries that provided in
documentation:https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
, if there are another options for image processing please tell me.
<https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality>
- for Adding new words to the dictionary, should I add them directly to
ara.wordlist

Any Help, thank you.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+***@googlegroups.com.
To post to this group, send email to tesseract-***@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/07d8a690-7837-40f7-8d7b-92651518ec8a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.