How to set dpi for Jpeg when converting ghostscript rasterizer

How to set dpi for Jpeg when converting ghostscript rasterizer - jpeg

I want to generate PDF to Jpeg image with a width of 900px and 150 dpi with the help of Ghostscript rasterizer.

You can set the size of the image in 2 ways; firstly if you know the size of the PDF media (the MediaBox), which is in the PDF file, and is in PostScript units (1/72 of an inch) then a simple calculation will give you the required rendering resolution:
target X resolution = output width in pixels / (Media width / 72)
target Y resolution = output height in pixels / (Media height / 72)
You can then set the resolution using the -r switch as described in the documentation here
Alternatively you can set the output media size in pixels using the -g switch, and then use the -dPDFFitPage switch to have Ghostscript scale the PDF content so that it fits into the output. Note that this method scales isomorphically. That is the same scale factor is applied to both the x and y directions.
The -g switch is described in the documentation here and the -dPDFFitPage switch is described here.

Related

Ghostscript : Crop Certain Area?

I am new to ghostscript.
I have a pdf which contains a card. i want to crop that card out.
Currently with the understanding of document i am only able to convert the pdf to image but have no luck in cropping.
Saw every other related question but there are not working for me.
This is code i used in batch file for converting the pdf to image:
"C:\Program Files\gs\gs9.50\bin\gswin64c.exe" -sDEVICE=png16m -r300 -o c:\users\jen\desktop\pdf.png -f "c:\users\jen\desktop\pdf.pdf
pause
now i don't know how to crop with it too ?
i want to crop at certain postition like: Left:28 Top:524 Width:492.3 Height:161
EDIT
I will be using this in firebase functions.
Example PDF file THE_PDF_TO_CROP. I want to cutout the blue area of pdf to image.

You need to set several parameters; Firstly you need to specify the width and height of the output bitmap. You can use either -dDEVICEHEIGHTPOINTS and -dDEVICEWIDTHPOINTS, or alternatively you can specify the output size in pixels using -g<x>x<y> where and are the number of pixels in the x and y directions. Obviously that will vary depending on the resolution. You can't (obviously) use fractional pixels.
If you use -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS then you also need to set -dFIXEDMEDIA to tell the interpreter not to use the media size from the PDF file instead.
So that shoould create an output bitmap of the correct size. If you try rendering your file using just that, you will see that it renders just a portion of the page from the bottom left. So now you need to shift the content around so that the portion you want lies at the bottom left of the media. You can do that by using the PageOffset PostScript operator.
You haven't given any numbers, nor supplied an example file, so lets say (for the sake of example) that you want to render a 1 inch by 2 inch portion of the document. Lets further say that you the part you want rendered starts 2.5 inches from the left edge, and 1.5 inches from the bottom edge.
A suitable command line would be:
gs -sDEVICEWIDTHPOINTS=72 -dDEVICEHEIGHTPOINTS=144 -dFIXEDMEDIA -r300 -sDEVICE=png16m -o out.png -c "<</PageOffset [-180 -108]>> setpagedevice" -f input.pdf
Note that PDF (and PostScript) units are 1/72 inch so 72 = 1 inch, 144 = 2 inches. You need to shift the origin of the page down and left, which is why the values for PageOffset are negative.
If that doesn't work for you I'll need to see your PDF file and you'll need to tell me which version of Ghostscript you are using.

jpeg binary file header with inverted width & height

I am calculating a jpg file size by parsing it as a binary file with nodejs.
But the computed width height gives 4032x3024 when the actual size is 3024x4032.
Its header 0xFFC0 block is as follow:
ffc0 0011 080b d00f c003
According to JPEG format, sizes should be :
height = 0bd0 (i.e:3024)
width = 0fc0 (i.e:4032)
(Using Imagemagick identify program confirms this calculation:
identify HJ7XFd9le.jpg
HJ7XFd9le.jpg JPEG 4032x3024 4032x3024+0+0 8-bit sRGB 380KB 0.000u 0:00.000)
But when viewing the image in my mac viewer, the inspector indicates a size of 3024x 4032!
How can i compute the correct size programmatically by parsing the file as a binary file?
Thanks!

Ah - it's not that the size is inverted, it's that you need to respect the EXIF orientation (rotated 90/270). The sensor is in a certain orientation and if you capture the image rotated, the camera software may signify the change by putting a non-0 orientation in the APP1 EXIF data. – BitBank

Setting the size of a SVG file (via Batik)

If I render to a bitmap then the bitmap has a specific number of pixels and a DPI. That combination makes it easy to draw a square that is 1" x 1" - I render lines for each side that are DPI pixels long.
When I create a SVG, I think it should still be able to be set this way. Where I set the units per inch and also the size in those units of the object as a whole. Yes you can zoom on a SVG file as it's all vectors, but it should still have a 100% zoom size to render to.
In my case I am using EMUs for my units. So 914400 units/inch. So question #1 is, how do I set the scaling using Batik. For a bitmap it's:
AffineTransform scaleToEmus = AffineTransform.getScaleInstance(dpi / (float) DrawingSurface.EPI, dpi / (float) DrawingSurface.EPI);
graphics.transform(scaleToEmus);
But there is no dpi equivalent for SVG.
And then for a given width & height that is in EMUs, do I set the size (or maximum extent) of the image using:
svgGraphics.setSVGCanvasSize(new Dimension(width, height));
I think I'm not fully understanding SVG, or at least Batik as I don't see how to set the units to render at for a 100% zoom.

How to prevent the white border after convert with ghostscript

i try to convert an .eps File to .png with ghostscript.
The .eps file has a resolution of 1000x1000 px. But the outfile has big white borders on left and on the bottom side.
gs -dNOPAUSE -dBATCH -r1000x1000 -q -sDEVICE=png256 -dDEVICEWIDTHPOINTS=880 -dDEVICEHEIGHTPOINTS=720 -sOutputFile=infile.png infile.eps

EPS files don't have a resolution, so it cannot possibly have a resolution of 1000x1000, especially not 1000x1000 pixels, because that's not a resolution, its a size.
I very much doubt you want to set the resolution to 1000 dpi and at the same time set a media size of 880 points x720 points. That will result in a .png 12000x10000 pixels. (There are 72 points to the inch, which means you are setting a media of 12x10 inches at 1000 doits per inch)
The correct way to handle an EPS file (which is slightly but importantly different to a PostScript file) is to arrange the scaling yourself.
If the dimensions of the resulting image are not important to you, then you can use -dEPSCrop which will produce an image where the dimensions of the media are taken from the comments in the EPS file.
If you require that the image has specific dimensions then you should use -g to set the media size (in pixels), set -dFIXEDMEDIA and set -dEPSFitPage which will scale the EPS to fit the dimensions of the media.

I found the solution :
-dEPSCrop

Not sure what is causing that without seeing the eps file, but you can trim it off with ImageMagick like this:
convert SomeFile.png -trim result.png
ImageMagick is installed on most Linux distros and is available for OSX, and Windows.

Extracting Text from a PDF file with embedded font

I have a PDF file containing some tabular data.
http://dl.dropbox.com/u/44235928/sample_rotate-0.pdf
I have to extract the tabular data from it. I have tried following with no success :
Select the text and paste it to notepad/excel-sheet. (I am getting junk characters)
Used save as text from Acrobat Reader. It is also giving junk characters and not the actual text.
Tried ApachePDFBox command line utility to extract text from PDF. It is also giving junk characters instead of real texts.
Finally I am trying a OCR solution. I am converting the pdf file into .tif images using ImageMagick and getting those images processed by tesseract OCR.
The OCR solution is not very accurate though( about 80% words matched ).
I tried changing density and geometry of the image created from PDF to get better results from tesseract OCR.
convert -rotate 90 -geometry 10000 -depth 8 -density 800 sample.pdf img_800_10000.tif;
tesseract img_800_10000.tif img_800_10000.tif nobatch letters;
I am not sure for what kind of image( density, geometry, monochromatic, sharpen boundary etc) would be best suited for the OCR.
Please suggest what could be the best possible parameters(density,geometry,depth etc) for generating images from a PDF file, so that the tesseract accuracy will increase.
I am open to other( non-ocr ) solutions as well.

In this case I recommend to NOT use ImageMagick for the PDF -> TIFF conversion. Instead, use Ghostscript. Two reasons:
Using Ghostscript directly will give you more control over individual parameters of the conversion.
ImageMagick cannot do that particular conversion itself -- it will call Ghostscript as its 'delegate' anyway, but will not allow you to give all the same fine-grained control that your own Ghostscript command will give you.
Most of the text in the table of your sample PDF is extremely small (I guess, only 4 or 5 pt high). This makes it rather difficult to run a successful OCR unless you increase the resolution considerably.
Ghostscript uses -r72 by default for image format output (such as TIFF). Tesseract works best with r=300 or r=400 -- but only for a font size from 10-12 pt or higher. Therefor, to compensate for the small text size you should make Ghostscript using a resolution of at least 1200 DPI when it renders the PDF to the image.
Also, you'll have to rotate the image so the text displays in the normal reading direction (not bottom -> top).
This is the command which I would try first:
gs \
-o sample.tif \
-sDEVICE=tiffg4 \
-r1200 \
-dAutoRotatePages=/PageByPage \
sample_rotate-0.pdf
You may need to play with variations of the -r1200 parameter (higher or lower) for best results.

Since a comment asked "How to define the geometry of an image when using Ghostscript as we do in convert?", here is an answer:
It does not make sense to define geometry (that is image dimensions) and resolution for a raster image created by Ghostscript at the same time.
Once you convert a vector based page of a given dimension (such as PDF) into a raster image (such as the TIFF G4 format) giving a desired resolution (as done in the other answer), you already indirectly and implicitly also did set the dimension:
The original PDF dimension of your sample file sample_rotate-0.pdf is 1008x612 points.
At a resolution of 72 DPI (the default Ghostscript uses if not given directly, or -r72 in the Ghostscript command if given directly) the image dimensions will be 1008x612 pixels.
At a resolution of 720 DPI (-r720 in the Ghostscript command) the image dimensions will be 10080x6120 pixels.
At a resolution of 1440 DPI (-r1440 in the Ghostscript command of my other answer) the image dimensions will be 20160x12240 pixels.
At a resolution of 1200 DPI (-r1200 in the Ghostscript command) the image dimensions will be 16800x10200 pixels.
At resolution of 1000 DPI (-r1000 in the Ghostscript command) the image dimensions will be 14000x8500 pixels.
At a resolution of 120 DPI (-r120 in the Ghostscript command) the image dimensions will be 1680x1020 pixels.
At resolution of 100 DPI (-r100 in the Ghostscript command) the image dimensions will be 1400x850 pixels.
If you absolutely insist to specify the dimension/geometry for the output image on the Ghostscript commandline (rather than the resolution), you can do so by adding -gNNNNxMMMM -dPDFFitPage to the commandline.

There you can find decoded content of your file: https://docs.google.com/open?id=0B1YEM-11PerqSHpnb1RQcnJ4cFk
A absolutely sure the OCR is the best way to read pdf file, but you can try REGEX-ing the native content. It going to be be the hard and long way.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string