How to use Tesseract OCR (no gImagereader in repo) [Solved]

Discussions Regarding Software

Moderator: Moderators

How to use Tesseract OCR (no gImagereader in repo) [Solved]

Postby linuxfluesterer » Fri Jun 16, 2017 14:20

Hallo guys.
In a 'Linux User' magazine I read about tesseract OCR engine which is already in Sabayon repo.
In this article the graphical frontend gimagereader is discussed. I would like to try this [gimagereader[/i], so I installed (copied) it from debian package, as I do often, when a program is not in Sabayon repo.
But when I start it as user the I receive an error:
Code: Select all
gimagereader: error while loading shared libraries: libgtkspellmm-3.0.so.0: cannot open shared object file: No such file or directory

I have searched for the missing libgtkspell-3.0 but this is not part of Sabayon repo.
So, two questions:
1. Is there any way to make gimagereader work?
2. if not, which program (like libreoffice or else) uses tesseract for OCR?

I'm using Sabayon with Kernel 4.10 and Plasma 5.9.5, all with latest updates.
Thank you in advance.

-Linuxfluesterer (I love KDE...)
Last edited by linuxfluesterer on Tue Jul 11, 2017 22:55, edited 1 time in total.
Take away Facebook from me and let there be real people again...
User avatar
linuxfluesterer
Old Dear Hen
 
Posts: 783
Joined: Thu Sep 20, 2012 19:47
Location: Germany

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby Fitzcarraldo » Sat Jun 17, 2017 13:09

I have an alternative suggestion: Use media-gfx/gscan2pdf instead of gimagereader. Although gscan2pdf is primarily intended to use with a scanner, it can also open image files (PNG, JPG etc.) and can perform OCR (a number of OCR engines are supported, including Tesseract) on scanned images and existing image files. gscan2pdf is in the Sabayon Entropy repositories.

EDIT: BTW, gscan2pdf can use various OCR engines: gocr, tesseract, ocropus and cuneiform. Three of those OCR engines are in the Sabayon Entropy repositories.
User avatar
Fitzcarraldo
Sagely Hen
 
Posts: 8077
Joined: Sat Mar 10, 2007 5:40
Location: United Kingdom

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby svantoviit » Wed Jul 05, 2017 16:14

You could also give yagf a try.
Working with tesseract from the command line is not that hard either btw.

linuxfluesterer wrote:I installed (copied) it from debian package, as I do often

Not the smartest way to install something in Sabayon. I would look in direction Gentoo first.
Code: Select all
$ eix -R gimagereader
* app-text/gimagereader [1]
     Available versions:  (~)3.2.0 {scanner}
     Homepage:            https://github.com/manisandro/gImageReader
     Description:         A tesseract OCR front-end

[1] "salfter" layman/salfter
svantoviit
Old Dear Hen
 
Posts: 691
Joined: Sun Feb 28, 2010 17:55

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby linuxfluesterer » Mon Jul 10, 2017 18:36

Fitzcarraldo wrote:I have an alternative suggestion: Use media-gfx/gscan2pdf instead of gimagereader. Although gscan2pdf is primarily intended to use with a scanner, it can also open image files (PNG, JPG etc.) and can perform OCR (a number of OCR engines are supported, including Tesseract) on scanned images and existing image files. gscan2pdf is in the Sabayon Entropy repositories.

EDIT: BTW, gscan2pdf can use various OCR engines: gocr, tesseract, ocropus and cuneiform. Three of those OCR engines are in the Sabayon Entropy repositories.


Hallo Fitz.
Sorry, I was very busy with learning bash, scripting and variables and so on.
Today I've installed gscan2pdf and just after starting the new program I instantly could scan with my Canon MX925 device. I got a new window, clicked on 'scan' and after about 80% of scanning I received an error:
Image

I don't get any recognized text from my scanned document. tesseract is already installed. So, what is wrong, pls? What can I do to repair?

Thanks in advance.

-Linuxfluesterer (I love KDE...)
Take away Facebook from me and let there be real people again...
User avatar
linuxfluesterer
Old Dear Hen
 
Posts: 783
Joined: Thu Sep 20, 2012 19:47
Location: Germany

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby Fitzcarraldo » Tue Jul 11, 2017 1:32

Does the mentioned Tesseract data file (ara.cube.lm) exist in your installation? Have a look in the directory /usr/share/tessdata/ to see which Tesseract data files are there. For example, in my case I only have the file 'eng.traineddata' because I installed Tesseract in Gentoo just for English. As you are getting a message that the Cube OCR engine cannot find a Cube data file for Arabic, try downloading all the relevant Tesseract data files and Cube data files from the Tesseract Wiki page https://github.com/tesseract-ocr/tesser ... Data-Files for the version of Tesseract that you have installed (3.05.00 from the Sabayon Weekly repository, I assume?). What you are seeing may be because the SL Entropy Tesseract package was built for all available languages; notice the languages USE flags are all set in the Entropy package:

https://packages.sabayon.org/show/tesse ... -show-what

If you have no joy with Tesseract, try GOCR instead. I find it gives me better results than tesseract.
User avatar
Fitzcarraldo
Sagely Hen
 
Posts: 8077
Joined: Sat Mar 10, 2007 5:40
Location: United Kingdom

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby linuxfluesterer » Tue Jul 11, 2017 12:42

Hallo Fitz.
Thank you again for quick reply.
Now I analyzed the scanned document with gocr. I need to start text recognition manually though I chose gocr in text engine menu because when start scanning from main menu in scan2pdf it is still assumed tesseract is the OCR engine as default, which leads to the 'cube' error.

Ok, when I choose gocr engine, I receive a more or bad recognized text then.
But how do I save this result as a text file? When I save, I can save in several pic formats and also in pdf format.
I need the recognized text as a text file. Could you pls. tell me, how will I get it saved?
When I open the pdf file in okular, the tool 'select text field' doesn't work.
Thank you again.

-Linuxfluesterer (I love KDE...)
Take away Facebook from me and let there be real people again...
User avatar
linuxfluesterer
Old Dear Hen
 
Posts: 783
Joined: Thu Sep 20, 2012 19:47
Location: Germany

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby Fitzcarraldo » Tue Jul 11, 2017 18:10

linuxfluesterer wrote:it is still assumed tesseract is the OCR engine as default, which leads to the 'cube' error.

You can change the default OCR engine via the GUI after you have opened an image file: Tools > OCR > OCR Engine. If you cannot get that to work, you can edit the gscan2pdf configuration file instead (look for the line containing "ocr engine"):

Code: Select all
$ grep engine ~/.config/gscan2pdfrc
   "ocr engine" : "gocr",


Do you still get the error message after you have copied the Tesseract V3.05 *.traineddata files for all the languages (including ara.traineddata) and the Tesseract Cube data files listed below to the /usr/share/tessdata/ directory?:

ara.cube.bigrams, ara.cube.fold, ara.cube.lm, ara.cube.nn, ara.cube.params, ara.cube.word-freq, ara.cube.size, ara.tesseract_cube.nn

linuxfluesterer wrote:But how do I save this result as a text file? When I save, I can save in several pic formats and also in pdf format. I need the recognized text as a text file. Could you pls. tell me, how will I get it saved? When I open the pdf file in okular, the tool 'select text field' doesn't work.

Right-click on the text in the OCR pane in gscan2pdf and a window titled 'Editing text...' pops up. You can edit the text in this window and/or copy it and paste it into another application.

EDIT: The command 'man gscan2pdf' has some useful information. Also, the Web page http://gscan2pdf.sourceforge.net/ is worth reading. Notice, for example, "There is an interesting review of OCR software at https://web.archive.org/web/20080529012 ... ate.ca/ocr. An important conclusion was that 400ppi is necessary for decent results."
User avatar
Fitzcarraldo
Sagely Hen
 
Posts: 8077
Joined: Sat Mar 10, 2007 5:40
Location: United Kingdom

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby linuxfluesterer » Tue Jul 11, 2017 22:54

Fitzcarraldo wrote:You can change the default OCR engine via the GUI after you have opened an image file: Tools > OCR > OCR Engine. If you cannot get that to work, you can edit the gscan2pdf configuration file instead (look for the line containing "ocr engine"):

Code: Select all
$ grep engine ~/.config/gscan2pdfrc
   "ocr engine" : "gocr",



I checked this file, "ocr engine" is already set to "gocr"
Anyway, there was another error, that the package unpaper was not istalled. After I installed unpaper, there was no more error.

Fitzcarraldo wrote:Do you still get the error message after you have copied the Tesseract V3.05 *.traineddata files for all the languages (including ara.traineddata) and the Tesseract Cube data files listed below to the /usr/share/tessdata/ directory?:

ara.cube.bigrams, ara.cube.fold, ara.cube.lm, ara.cube.nn, ara.cube.params, ara.cube.word-freq, ara.cube.size, ara.tesseract_cube.nn


Sorry, I don't get it, When I enter the sourceforge website I can only download ara.bin files in version 4 and version 3.04/3.05. And yes, tesseract is in Sabayon repo with version 3.05.
So, where are the all the ara.xxx.xxx files? Too complicated.

Fitzcarraldo wrote:Right-click on the text in the OCR pane in gscan2pdf and a window titled 'Editing text...' pops up. You can edit the text in this window and/or copy it and paste it into another application.

Finally with this I had success. I can let recognize the scanned document as text and then copy and paste.

So then, Fitz, thanks to you, I could make gscan2pdf make run and I can use OCR. I will read your recommendations to OCR topic also. I already found the hint, that 400 ppi is necessary.
I will mark this thread as solved.

-Linuxfluesterer (I love KDE...)
Take away Facebook from me and let there be real people again...
User avatar
linuxfluesterer
Old Dear Hen
 
Posts: 783
Joined: Thu Sep 20, 2012 19:47
Location: Germany

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby Fitzcarraldo » Wed Jul 12, 2017 1:00

linuxfluesterer wrote:
Fitzcarraldo wrote:Do you still get the error message after you have copied the Tesseract V3.05 *.traineddata files for all the languages (including ara.traineddata) and the Tesseract Cube data files listed below to the /usr/share/tessdata/ directory?:

ara.cube.bigrams, ara.cube.fold, ara.cube.lm, ara.cube.nn, ara.cube.params, ara.cube.word-freq, ara.cube.size, ara.tesseract_cube.nn

Sorry, I don't get it, When I enter the sourceforge website I can only download ara.bin files in version 4 and version 3.04/3.05. And yes, tesseract is in Sabayon repo with version 3.05.
So, where are the all the ara.xxx.xxx files? Too complicated.

The ara.*.* files can be found further down on the Tesseract Wiki page: https://github.com/tesseract-ocr/tesser ... ion-304305

Tesseract V4.*, which is not yet available in SL, no longer uses the Cube OCR engine, so those files would no longer have been needed had you been using Tesseract 4.*. Anyway, GOCR is simpler to use.
User avatar
Fitzcarraldo
Sagely Hen
 
Posts: 8077
Joined: Sat Mar 10, 2007 5:40
Location: United Kingdom

Re: How to use Tesseract OCR (no gImagereader in repo)

Postby Fitzcarraldo » Wed Jul 12, 2017 17:18

linuxfluesterer wrote:
Fitzcarraldo wrote:Right-click on the text in the OCR pane in gscan2pdf and a window titled 'Editing text...' pops up. You can edit the text in this window and/or copy it and paste it into another application.

Finally with this I had success. I can let recognize the scanned document as text and then copy and paste.

BTW, you can also click on File > Save in the main gscan2pdf menu, select 'Image type' to be 'Text', then click Save and enter a file name for the text file in the pop-up window.
User avatar
Fitzcarraldo
Sagely Hen
 
Posts: 8077
Joined: Sat Mar 10, 2007 5:40
Location: United Kingdom


Return to Software in General

Who is online

Users browsing this forum: No registered users and 1 guest

cron