Tesseract 3.04 PDF Output is Blue - linux

Background
I am using Tesseract-OCR 3.04 on a Linux setup to batch OCR process for a bunch of non-searchable PDFs.
My process is such that I take the PDF, convert it to a tiff format, then using Tesseract, I convert that tiff into a searchable pdf format.
The issue
The output from the Tesseract 3.04 tiff to pdf conversion always produces a pdf with a blue background. I have checked and the tiff file has a white background.
Here is the output I am getting. Obviously mostly-censored for privacy.
What I have tried
I have created by own "untouched" tiff file with a white background and ran it through Tesseract to a pdf output and the blue background persists. I did this by typing paragraphs of text into my text editor, screenshotting it, and converting it to tiff.
I have had absolutely no results in google searching my issue.
--
I do not know what the issue is within the Tesseract process, does anyone have any information that could help?
Thanks!

Related

Opening an EPS file in Inkscape causes weird line artifacts

I'm trying to edit a vector graphics file from Freepik. The format is EPS and after installing both Inkscape and Ghostscript on Windows, I'm able to open the file with Inkscape. However, Inkscape introduces some weird artifacts (see lines and wrong colors in the picture below).
Side by side comparison, original vector (left) and SVG saved after opening the EPS file in Inkscape (right)
Is there a way to fix this issue?
It's a little difficult to tell, partly because this is a complex illustration and partly because the rendering is a little small. I'd suggest that the circular artefacts are caused by radial fills not being rendered completely.
This could simply be a rendering problem with Inkscape, or it could be that the radial fill has an Extend parameter which isn't being honoured. It could also be a problem calculating a clip.
It's not entirely obvious what you used to render the left hand image, is that Ghostscript ?
Generally I'd say this looks like an Inkscape bug and you should report it as such.
Edit
Reading through the Inkscape FAQ it seems that Inkscape uses SVG as its native format. That's going to mean that an awful lot of PostScript (and PDF) vector objects aren't going to be represented well. Shadings will either have to be rendered to an image or converted into a complex series of SVG primitives.
Following the link on 'How to open EPS files in Windows' from the FAQ suggests to me that EPS files are either rendered to an image or converted to PDF.
You could use Ghostscript to convert the EPS to PDF yourself, and then try loading the PDF into Inkscape to see if you get a better result. You can also open the PDF in, say, Acrobat to see if it looks OK there.
If the PDF looks fine in Acrobat, but not so good in Inkscape, then I'd say that's an Inkscape problem. If the PDF looks poor in Acrobat then that's a Ghostscript problem.
You can then report the problem as a bug to the appropriate site.
It seems that EPS has more capabilities than SVG and that's why some stuff looks weird when converted to PDF/SVG. Specifically, highlights in an EPS file are not properly rendered in an SVG file.
I checked the conversion from EPS to PDF via Ghostscript and the lines are already there, i.e. it's not an Inkscape bug.
Here's the original file to reproduce the problem:
https://www.freepik.com/free-vector/data-processing-factory-isometric-technology_8625296.htm
And here's what it looks like after converting it to PDF: The artifacts are not as noticeable on the PDF file, possibly because Ghostscript converts it with a higher DPI by default
My workaround to be able to edit the file (remove the background) was to:
open the EPS with Inkscape, ungroup the items
delete the background
export it as PNG
then use the PNG as a "mask" on GIMP to edit the JPG file that came together with the EPS.

PDF images unscaled to PDF document using pdfrw/ReportLab

This question is very similar to PDF image in PDF document using ReportLab (Python), but I seem unable to adapt it to my needs:
I want to add vectorized images (available in SVG oder PDF format) to an A4 PDF output. The images must not get scaled! They should simply be placed from top to bottom with some vertical spacing and automatic page breaks.
No text or other content is required. Basically, I'm looking for a pdfnup solution. In the past, I have used pdflatex with a simple input file for the task, but this is no option for the target system.

flatten images with transparency in PDF

How to flatten images in PDF files with transparency?
convert PDF to PS (postscript)
pdftops input.pdf output.pdf.ps
If a PDF file contains eg. PNG files with alpha channel (transparency) the PDF is rendered/rasterized to an image and that is not a solution because then you lose the plain text in the file
Is there a tool (linux command line) to flatten images in PDF files with transparency?
Its not clear why you want to do this. If you want PostScript then Ghostscript can produce PostScript for you from a PDF file (use the ps2write device). Obviously transparency will have to be rendered to an image, in which case the resolution is important. The default is 720 dpi which is probably higher than you might need.
Note that a PDF file can't contain a PNG, that's not a possible image type in PDF. A PNG would have to be stored as an image with a separate alpha.

Extracting Text from a PDF file with embedded font

I have a PDF file containing some tabular data.
http://dl.dropbox.com/u/44235928/sample_rotate-0.pdf
I have to extract the tabular data from it. I have tried following with no success :
Select the text and paste it to notepad/excel-sheet. (I am getting junk characters)
Used save as text from Acrobat Reader. It is also giving junk characters and not the actual text.
Tried ApachePDFBox command line utility to extract text from PDF. It is also giving junk characters instead of real texts.
Finally I am trying a OCR solution. I am converting the pdf file into .tif images using ImageMagick and getting those images processed by tesseract OCR.
The OCR solution is not very accurate though( about 80% words matched ).
I tried changing density and geometry of the image created from PDF to get better results from tesseract OCR.
convert -rotate 90 -geometry 10000 -depth 8 -density 800 sample.pdf img_800_10000.tif;
tesseract img_800_10000.tif img_800_10000.tif nobatch letters;
I am not sure for what kind of image( density, geometry, monochromatic, sharpen boundary etc) would be best suited for the OCR.
Please suggest what could be the best possible parameters(density,geometry,depth etc) for generating images from a PDF file, so that the tesseract accuracy will increase.
I am open to other( non-ocr ) solutions as well.
In this case I recommend to NOT use ImageMagick for the PDF -> TIFF conversion. Instead, use Ghostscript. Two reasons:
Using Ghostscript directly will give you more control over individual parameters of the conversion.
ImageMagick cannot do that particular conversion itself -- it will call Ghostscript as its 'delegate' anyway, but will not allow you to give all the same fine-grained control that your own Ghostscript command will give you.
Most of the text in the table of your sample PDF is extremely small (I guess, only 4 or 5 pt high). This makes it rather difficult to run a successful OCR unless you increase the resolution considerably.
Ghostscript uses -r72 by default for image format output (such as TIFF). Tesseract works best with r=300 or r=400 -- but only for a font size from 10-12 pt or higher. Therefor, to compensate for the small text size you should make Ghostscript using a resolution of at least 1200 DPI when it renders the PDF to the image.
Also, you'll have to rotate the image so the text displays in the normal reading direction (not bottom -> top).
This is the command which I would try first:
gs \
-o sample.tif \
-sDEVICE=tiffg4 \
-r1200 \
-dAutoRotatePages=/PageByPage \
sample_rotate-0.pdf
You may need to play with variations of the -r1200 parameter (higher or lower) for best results.
Since a comment asked "How to define the geometry of an image when using Ghostscript as we do in convert?", here is an answer:
It does not make sense to define geometry (that is image dimensions) and resolution for a raster image created by Ghostscript at the same time.
Once you convert a vector based page of a given dimension (such as PDF) into a raster image (such as the TIFF G4 format) giving a desired resolution (as done in the other answer), you already indirectly and implicitly also did set the dimension:
The original PDF dimension of your sample file sample_rotate-0.pdf is 1008x612 points.
At a resolution of 72 DPI (the default Ghostscript uses if not given directly, or -r72 in the Ghostscript command if given directly) the image dimensions will be 1008x612 pixels.
At a resolution of 720 DPI (-r720 in the Ghostscript command) the image dimensions will be 10080x6120 pixels.
At a resolution of 1440 DPI (-r1440 in the Ghostscript command of my other answer) the image dimensions will be 20160x12240 pixels.
At a resolution of 1200 DPI (-r1200 in the Ghostscript command) the image dimensions will be 16800x10200 pixels.
At resolution of 1000 DPI (-r1000 in the Ghostscript command) the image dimensions will be 14000x8500 pixels.
At a resolution of 120 DPI (-r120 in the Ghostscript command) the image dimensions will be 1680x1020 pixels.
At resolution of 100 DPI (-r100 in the Ghostscript command) the image dimensions will be 1400x850 pixels.
If you absolutely insist to specify the dimension/geometry for the output image on the Ghostscript commandline (rather than the resolution), you can do so by adding -gNNNNxMMMM -dPDFFitPage to the commandline.
There you can find decoded content of your file: https://docs.google.com/open?id=0B1YEM-11PerqSHpnb1RQcnJ4cFk
A absolutely sure the OCR is the best way to read pdf file, but you can try REGEX-ing the native content. It going to be be the hard and long way.

Wrong colours when converting TIFF image to PNG in ImageMagick

I'm working on a PHP script that automatically converts TIFF images to PNG files.
For that purpose, I use ImageMagick:
$ convert a.tif a.png
It works to some degree, however, the colours are very acute and deviant from the way they are pictured on my PC. To illustrate the problem, please have a look at the enclosed files, the include:
The Windows Live Foto Gallery output (that's pretty much how I want it to be)
The ImageMagick output (the mess I end up with)
The original TIFF file
Has anyone an idea whether, and if so how, I can alter the ImageMagick colour interpretation?
Thanks a lot!
Alright,
thanks to ergosys, the problem was easily solved: I needed to apply ICC colour profiles.
The XMP declared ISO 12647-2:2004, which was to be found at http://eci.org.
$ convert -profile ISOcoated_v2_eci.icc -profile eciRGB_v2.icc a.tif c.png
When converting from a CMYK color space to an RGB color space, as you do when going from tiff to png, you have to convert the color spaces along with the image. Try:
convert -colorspace rgb a.tif a.png
I ran this locally and get a better result from this than from the command line in your question, but my color vision sucks, so I can't guarantee that this is what you were after. =] Hope it gets you on the right track, anyway.

Resources