I have a pdf file of 10 pages , and I want to merge every two pages of it into a single page , like 1,2->1 : 3,4->2 : and so on ... I learnt about Ghostscript but these are the tools for compressing the .pdf , also there are utilities to merge two or more pdf s together into one,but I unfortunately could not find any to merge pages in the same pdf . Kindly help !
You can do this with cpdf.
cpdf -twoup in.pdf -o out.pdf
Installing pdfjam adds a command called pdfjoin that can be used to join multiple PDF files into one. If your distribution doesn't come with pdfjam, you could also try pdftk.
pdfunite is part of the poppler package
pdfunite 01.pdf 02.pdf 03.pdf out.pdf
You can do this with pdftk
pdftk test.pdf test1.pdf cat output output.pdf
Related
Goal
I'm converting some .docx to .md with pandoc. These .docx have images that, after conversion, were being placed in a directory (markdown-repository/media/) and it's URL was being referenced in the resulting .md file.
So the goal is to have the resulting .md files with links pointing to the proper images stored in markdown-repository/media/. For this to happen, all images under markdown-repository/media/ need to have an unique name.
The problem
For each conversion, the images were being smashed by the last conversion, because pandocs doesn't track the image names, it creates image1.png, image2.png, image3.png, etc... for each converted file.
My suggestion
Create a folder to store media for each file, and this folder that contains the media, would have the name of the file converted.
Generate random, unique name, for the images.
Replace the links in the .md file with the generated images names.
Example:
fileA.docx
fileB.docx
Step 1
Convert the .docx to .md:
pandoc --extract-media=/result-media/output-media-for-fileA/ -f docx -t markdown fileA.docx -o fileA.md
pandoc --extract-media=/result-media/output-media-for-fileB/ -f docx -t markdown fileB.docx -o fileB.md
At this point we will have under /result-media/output-media-for-fileA/ 3 images
image1.png
image2.png
image3.png
and in the fileA.md these 3 links pointing to those images:
![](/result-media/output-media-for-fileA/image1.png)
![](/result-media/output-media-for-fileA/image2.png)
![](/result-media/output-media-for-fileA/image3.png)
Note: the same for fileB (i will not put here to be more simple, just replace fileA for fileB in the links)
Step 2
Then generate unique file names for the images under /result-media/output-media-for-fileA/ and /result-media/output-media-for-fileB/ and somehow save "the logs" so that we can then replace the image name with the new name inside the fileA.md and fileB.md
Note: This step is where i'm having most difficulty.
Step 3
Then i could just move all images with unique name, to my main folder markdown-repository/media/ .
Resources
This problem was already asked on Pandoc forum but it seems that Pandoc doesn't have any feature to handle this so i believe that with the help of linux shell scripting we can turn around.
https://pandoc.org/MANUAL.html
Get-ChildItem . -Filter *.docx |
Foreach-Object {
pandoc --from docx --to markdown --extract-media= --wrap=none $_ -o $_.Name.Replace('.docx', '.md')
This worked for me:
Get-ChildItem . -Recurse -Filter *.docx |
Foreach-Object { pandoc --from docx --to markdown --extract-media=$($_.DirectoryName) --wrap=none $_.FullName -o $_.FullName.Replace('.docx', '.md')}
The script converts all docx to md and keeps the folder structure. What was missing in the --extract-media command was the directory location to create the .\media directory. The $.DirectoryName needs to be expanded, $($.DirectoryName); otherwise, powershell will read it as text as opposed to a variable which will result in unexpected results.
In regards to the --wrap parameter, from pandoc:
--wrap=auto|none|preserve
Determine how text is wrapped in the output (the source code, not the rendered version). With auto (the default), pandoc will attempt to wrap lines to the column width specified by --columns (default 72). With none, pandoc will not wrap lines at all. With preserve, pandoc will attempt to preserve the wrapping from the source document (that is, where there are nonsemantic newlines in the source, there will be nonsemantic newlines in the output as well). Automatic wrapping does not currently work in HTML output. In ipynb output, this option affects wrapping of the contents of markdown cells.
I need to convert from svg to png all the images in a folder. Lets say that images are called test1.svg, test2.svg, ... , testn.svg. Using the following script:
for i in *.svg
do
convert "$i" PNG24:"$i".png
done
does the job correctly, and the images are called test1.svg.png, test2.svg.png, ... , testn.svg.png. My questions are:
1) Is it possible to make the output images be called test1.png, test2.png, ... , testn.png, essentially removing the 'svg' part from the name?
2) Is it possible to send them directly into some other directory?
Thanks!
Yes. You can make another directory and send them there like this:
mkdir other
for i in *.jpg; do
convert "$i" PNG24:other/"${i%jpg}png"
done
If you have lots of images to do, and you are on macOS or Linux, I would recommend GNU Parallel to get the job done faster:
mkdir other
parallel convert {} PNG24:other/{.}.png ::: *jpg
I want to create a PDF with selectable/searchable text..
I have source.png which has gone through some pre-processing before OCR, and then I have view.jpg which is a compressed version of source.png to reduce the output PDF file
How do I define the view.jpg in the syntax?
tesseract -l eng source.png out pdf
I'm not sure whether you can specify view.jpg in the command. The out.pdf already contains some sort of a compressed source.png.
I have a PDF file (4.6MB) which was made by combining 6 different PDFs (containing both text and bitmap graphics) using pdftk in Ubuntu 12.04. I wish to compress this file to something close to 2MB without affecting its quality.
I have tried pdftk's "compress" option (couldn't compress it to 2 MB), also tried converting it to ps first and than back to pdf, it gives the following warning:
****Warning: considering '0000000000 XXXXX n' as a free entry.
and then hangs. qpdf also failed saying that the file is damaged.
Could someone help me out?
What result does Ghostscript give you? Try this command:
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/screen \
input.pdf
has this pdf file reserved infos? If it has no confidential data it would be interesting to see
anyway many times where qpdf fails, Multivalent works
you can try to use its Compress tool (it also attempts to repair pdf file)
Multivalent
https://rg.to/file/c6bd7f31bf8885bcaa69b50ffab7e355/Multivalent20060102.jar.html
(latest free version with tools included, current has no tools in itself)
java -cp path....to/Multivalent.jar tool.pdf.Compress file.pdf
This works for me to repair the damaged PDF
sudo apt-get install mupdf-tools
mutool clean input.pdf output.pdf
I've just very recently switched to Linux and I want to change a load of files to have different extensions. For example I want to change .doc/docx to .txt and images to .jpg and so on. Is there a csh script that would cover any extension or would I have to write a new one for each filetype.
I have this so far, but I'm not sure if it will actually work. Any help is much appreciated!
#!/bin/bash
for f in *.$1
do
[ -f "$f" ] && mv -v "$f" "${f%$1}$2"
done
No need to reinvent the wheel:
http://linux.die.net/man/1/rename
rename .doc .txt *.doc
You need proper programs to convert file format:
Use wvWare to convert doc to html
Use ImageMagick to convert png to jpg
Use html2text to convert html to txt
That would do the rename; keep in mind that renaming a Word document won't cause it to become text, though.