I'm doing pre-processing of images for OCR in python. I converted the pdf to binary images. The output I get is like this
I want the ouput to be something like this
Any idea how to go about this?
You have to use Tesseract library for extracting text from given image.
I am using window system so I downloaded it from location https://sourceforge.net/projects/tesseract-ocr-alt/files/.
Suppose you have installed it at location "E:\w\Tesseract-OCR"
Then put your image at the same location. Lets call your image question.png
Now go to command prompt and give command,
E:\w\Tesseract-OCR>tesseract.exe question.png answer.txt
Where answer.txt is text file which Tesseract will create you can give any other name instead of answer.txt and question.txt is your file.
Once command is successfully executed check output in answer.txt.
In case of your image I got following output.
Investment Type: Customer Owned
System Information
Fire III
Video I]
So in this case it is recognizing only text correctly.
I am generating a logfile on one of my servers.
Storing alot of data, then sending it to my mail once a month as a pdf file.
The prosess i am using is to 'cat' alot of commands to a text file, then convert it and send.
Is there any linux programs or some eazy way to do something simulare and add a image i have stored on the server in the pdf file?
This answer assumes that you just want to put the image at the end of the PDF.
You could first convert the image using imagemagick to a PDF doing this (will also work with different file types):
convert image.jpg image.pdf
Then, you can use a tool like stapler or pdftk to combine your generated text PDF and the image.pdf (you can add multiple images):
stapler cat text.pdf image.pdf combined.pdf
pdftk text.pdf image.pdf output combined.pdf
I am doing a process of converting SVG to EPS file and EPS file to DXF file format using shell scripting. While converting EPS file to DXF file using the following command:
pstoedit -dt -f dxf:-polyaslines ${epsfile} ${dxffile}
I am facing an issue in converting the text. In the output DXF file, the text are created using polylines, So there is no text property when I open the file in DXF viewer. I need the text should be created as text in DXF file. I gone through the following link http://manpages.ubuntu.com/manpages/hardy/man1/pstoedit.1.html. But I didn't got the exact solution what I need. Can any one help me with this?
I have several low quality pdfs. I would like to use OCR -- to be more precise Ocropus to get text from them. To do use, I use first ImageMagick -- a command line tool to convert pdf to images -- to transforms these pdfs into jpg or png.
However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling low quality pdfs to provide as-good-as-possible-quality images to OCR.
I have found this page, but I do not know where to start.
You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing
convert -list delegate
(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:
convert -list delegate | findstr /i png
Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:
convert -list delegate | grep -i png
You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:
convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF
Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.
About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:
By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
PDF can handle transparencies, which PostScript can not.
PDF can embed TrueType fonts, which Ghostscript can not. etc.pp.
Conversion in the direction PS => PDF is not that critical....)
That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:
gswin32c.exe ^
-sDEVICE=pngalpha ^
-o output/page_%03d.png ^
-r600 ^
d:/path/to/your/input.pdf
(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try
gs \
-sDEVICE=jpeg \
-o output/page_%03d.jpeg \
-r600 \
-dJPEGQ=95 \
/path/to/your/input.pdf
(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.
[*] D'oh! I missed to see your "linux" tag at first...
-density 600 or so should give you what you need.
At least two other tools you may want to consider:
pdfimages, which comes with the package poppler-utils, makes it easy to extract the images from a PDF without degrading them.
pdfsandwich, which can give you an OCR'd file by simply running pdfsandwich inputfile.pdf. You may need to tweak the options to get a decent result. See the official page for more info.
I need ai or eps files to cdr file command line converter.
Imagemagick is very good. Specifically the convert program.
Just type:
convert myimage.ai myimage.cdr
convert myimage.eps myimage.cdr
to convert either an .eps or an .ai to a .cdr