tesseract 3.03 - input jpg file - linux

I want to create a PDF with selectable/searchable text..
I have source.png which has gone through some pre-processing before OCR, and then I have view.jpg which is a compressed version of source.png to reduce the output PDF file
How do I define the view.jpg in the syntax?
tesseract -l eng source.png out pdf

I'm not sure whether you can specify view.jpg in the command. The out.pdf already contains some sort of a compressed source.png.

Related

Extract to same directory, media from different .docx files converted with Pandoc

Goal
I'm converting some .docx to .md with pandoc. These .docx have images that, after conversion, were being placed in a directory (markdown-repository/media/) and it's URL was being referenced in the resulting .md file.
So the goal is to have the resulting .md files with links pointing to the proper images stored in markdown-repository/media/. For this to happen, all images under markdown-repository/media/ need to have an unique name.
The problem
For each conversion, the images were being smashed by the last conversion, because pandocs doesn't track the image names, it creates image1.png, image2.png, image3.png, etc... for each converted file.
My suggestion
Create a folder to store media for each file, and this folder that contains the media, would have the name of the file converted.
Generate random, unique name, for the images.
Replace the links in the .md file with the generated images names.
Example:
fileA.docx
fileB.docx
Step 1
Convert the .docx to .md:
pandoc --extract-media=/result-media/output-media-for-fileA/ -f docx -t markdown fileA.docx -o fileA.md
pandoc --extract-media=/result-media/output-media-for-fileB/ -f docx -t markdown fileB.docx -o fileB.md
At this point we will have under /result-media/output-media-for-fileA/ 3 images
image1.png
image2.png
image3.png
and in the fileA.md these 3 links pointing to those images:
![](/result-media/output-media-for-fileA/image1.png)
![](/result-media/output-media-for-fileA/image2.png)
![](/result-media/output-media-for-fileA/image3.png)
Note: the same for fileB (i will not put here to be more simple, just replace fileA for fileB in the links)
Step 2
Then generate unique file names for the images under /result-media/output-media-for-fileA/ and /result-media/output-media-for-fileB/ and somehow save "the logs" so that we can then replace the image name with the new name inside the fileA.md and fileB.md
Note: This step is where i'm having most difficulty.
Step 3
Then i could just move all images with unique name, to my main folder markdown-repository/media/ .
Resources
This problem was already asked on Pandoc forum but it seems that Pandoc doesn't have any feature to handle this so i believe that with the help of linux shell scripting we can turn around.
https://pandoc.org/MANUAL.html
Get-ChildItem . -Filter *.docx |
Foreach-Object {
pandoc --from docx --to markdown --extract-media= --wrap=none $_ -o $_.Name.Replace('.docx', '.md')
This worked for me:
Get-ChildItem . -Recurse -Filter *.docx |
Foreach-Object { pandoc --from docx --to markdown --extract-media=$($_.DirectoryName) --wrap=none $_.FullName -o $_.FullName.Replace('.docx', '.md')}
The script converts all docx to md and keeps the folder structure. What was missing in the --extract-media command was the directory location to create the .\media directory. The $.DirectoryName needs to be expanded, $($.DirectoryName); otherwise, powershell will read it as text as opposed to a variable which will result in unexpected results.
In regards to the --wrap parameter, from pandoc:
--wrap=auto|none|preserve
Determine how text is wrapped in the output (the source code, not the rendered version). With auto (the default), pandoc will attempt to wrap lines to the column width specified by --columns (default 72). With none, pandoc will not wrap lines at all. With preserve, pandoc will attempt to preserve the wrapping from the source document (that is, where there are nonsemantic newlines in the source, there will be nonsemantic newlines in the output as well). Automatic wrapping does not currently work in HTML output. In ipynb output, this option affects wrapping of the contents of markdown cells.

How to convert pptx files to jpg or png (for each slide) on linux?

I want to convert a powerpoint presentation to multiple images. I already installed LibreOffice on my server and converting docx to pdf is no problem. pptx to pdf conversion does not work. I used following command line:
libreoffice --headless --convert-to pdf filename.pptx
Is there es way to convert pptx to pngs immediately or do I have to convert it to pdf first and then use ghostscript or something?
And what about the quality settings? Is there a way to choose the resolution of the resulting images?
Thanks in advance!
EDIT:
According to this link I was able to convert a pdf to images with the simple command line:
convert <filename>.pdf <filename>.jpg
(I guess you need LibreOffice and ImageMagick for it but not sure about it - worked on my server)
But there are still the problems with the pptx-to-pdf convert.
Thanks to googling and Sebastian Heyn's help I was able to create some high quality images with this line:
convert -density 400 my_filename.pdf -resize 2000x1500 my_filename%d.jpg
Please be patient after using it - you still can type soemthing into the unix console but it's processing. Just wait a few minutes and the jpg files will be created.
For further information about the options check out this link
P.S.: The aspect ratio of a pptx file doesn't seem to be exactly 4:3 because the resulting image size is 1950x1500
After Installing unoconv and LibreOffice you can use:
unoconv --export Quality=100 filename.pptx filename.pdf
to convert your presentation to a pdf. For further options look here.
Afterwards you can - as already said above - use:
convert -density 400 my_filename.pdf -resize 2000x1500 my_filename%d.jpg
to receive the images.
Convertion PPTX to PNG/JPG
This solution requires LibreOffice ( soffice ) and Ghostscript ( gs )
sudo apt install libreoffice ghostscript
Then two steps:
PPTX -> PDF
soffice --headless --convert-to pdf prezentacja.pptx
PDF -> PNG/JPG
gs -sDEVICE=pngalpha -o slajd-%02d.png -r96 prezentacja.pdf
-o slajd-%02d.png - output to file, %02d slajd number, two digits
-r96 - resolution:
96 -> 1280x720
144 -> 1920x1080
Not sure about libreoffice, but afaik its the only program to deal with pptx files.
I found this http://ask.libreoffice.org/en/question/23851/converting-pptx-to-pdf-issue/
If you have pdfs you can use imagemagick to output any quality pictures

How do we merge 2 pages in a pdf file on linux

I have a pdf file of 10 pages , and I want to merge every two pages of it into a single page , like 1,2->1 : 3,4->2 : and so on ... I learnt about Ghostscript but these are the tools for compressing the .pdf , also there are utilities to merge two or more pdf s together into one,but I unfortunately could not find any to merge pages in the same pdf . Kindly help !
You can do this with cpdf.
cpdf -twoup in.pdf -o out.pdf
Installing pdfjam adds a command called pdfjoin that can be used to join multiple PDF files into one. If your distribution doesn't come with pdfjam, you could also try pdftk.
pdfunite is part of the poppler package
pdfunite 01.pdf 02.pdf 03.pdf out.pdf
You can do this with pdftk
pdftk test.pdf test1.pdf cat output output.pdf

Insert PDF images into text, from pdftotext and pdfimages?

I was able to install the pdftotext utility (comes with Linux I guess) to convert PDF's into text, and extract the images on a Mac:
# install poppler, xpdf, and imagemagick
brew install imagemagick
brew install poppler # not sure if this worked, had to install `xpdf` from online .dmg
pdftotext sample.pdf output.txt
pdfimages sample.pdf pdf-images
# then convert .ppm to .jpg
# one at a time:
# convert pdf-images-001.ppm pdf-images-001.jpg
# batch:
mogrify -format jpg *.ppm
So now I have an output.txt with the (impressively well formatted) text from the PDF, and a bunch of images which I had to convert from .ppm to .jpg with ImageMagick.
Question is, is there any way to now insert references to these images in the right places in the output.txt document? Or, is there a way to combine those two commands so it extracts both text and images and creates links in the text to the images, all at once? Wondering if I have to manually write the parsing code to insert images into the text myself.

Repair apparently damaged pdf and reduce file size

I have a PDF file (4.6MB) which was made by combining 6 different PDFs (containing both text and bitmap graphics) using pdftk in Ubuntu 12.04. I wish to compress this file to something close to 2MB without affecting its quality.
I have tried pdftk's "compress" option (couldn't compress it to 2 MB), also tried converting it to ps first and than back to pdf, it gives the following warning:
****Warning: considering '0000000000 XXXXX n' as a free entry.
and then hangs. qpdf also failed saying that the file is damaged.
Could someone help me out?
What result does Ghostscript give you? Try this command:
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/screen \
input.pdf
has this pdf file reserved infos? If it has no confidential data it would be interesting to see
anyway many times where qpdf fails, Multivalent works
you can try to use its Compress tool (it also attempts to repair pdf file)
Multivalent
https://rg.to/file/c6bd7f31bf8885bcaa69b50ffab7e355/Multivalent20060102.jar.html
(latest free version with tools included, current has no tools in itself)
java -cp path....to/Multivalent.jar tool.pdf.Compress file.pdf
This works for me to repair the damaged PDF
sudo apt-get install mupdf-tools
mutool clean input.pdf output.pdf

Resources