Merging PDF files with similar names using PDFTK and a bash script - linux

I have a directory with a few hundred PDFs in it.
All of the PDFs filenames begin with a 5 digit number (and then have a bunch of other stuff at the end).
What I need to do is merge any PDFs together that start with the same 5 digit number.
Thoughts on how to do this via a shell script? Or other options? I'm using pdftk on Ubuntu.

Try this:
find . -type f -iname "[0-9][0-9][0-9][0-9][0-9]*.pdf" -printf "%.5f\n" \
| sort -u \
| while read -r file; do
echo pdftk ${file}*.pdf cat output $file.pdf ;
done
If output is okay, remove echo.

Related

Select and looping over differences in two directories linux

I have bash script that that loops through files in the raw folder and puts them into the audio folder. This works just fine.
#!/bin/bash
PATH_IN=('/nas/data/customers/test2/raw/')
PATH_OUT=('/nas/data/customers/test2/audio/')
mkdir -p /nas/data/customers/test2/audio
IFS=$'\n'
find $PATH_IN -type f -name '*.wav' -exec basename {} \; | while read -r file; do
sox -S ${PATH_IN}${file} -e signed-integer ${PATH_OUT}${file}
done
My issue is that, as the folders grow I do not want to run the script on the files that has already been converted, so I would like to loop over only the files that has not been converted yet. I.e the files only in raw but not in audio.
I found the function
diff audio raw
That can I do just that, but I cannot find a good way to incorporate this into my bash script. Any help or nudges in the right direction would be highly appreciated.
You could do:
diff <(ls -1a $PATH_OUT) <(ls -1a $PATH_IN) | grep -E ">" | sed -E 's/> //'
The first part will diff the files on both folders, the second part will filter out to get only the additions, and the third one will clean the list from the diff symbols to get just the names.

Only list files; iconv and directories?

I want to convert the coding of some csv-files with iconv. It has to be a script so I am working with while; do done. The script lists every item in a specific directory and converts them into another coding (utf-8).
Currently, my script lists EVERY item, including directories... So here are my questions
Does iconv has a problem with directories or does it ignore them?
And if there is a problem, how can I only list/search only for files?
I tried How to list only files in Bash? a ***./*** at the beginning of every item and that's kinda annoying (and my program doesn't like it, too).
Another possibility is ls -p | grep -v / but this would also affect files with / in the name, wouldn't it?
I hope you can help me. Thank you.
Here is the code:
for item in $(ls directory/); do
FileName=$item
iconv -f "windows-1252" -t "UTF-8" FileName -o FileName
done
Yea, i know, the input and output file cannot be the same^^
Use find directly:
find . -maxdepth 1 -type f -exec bash -c 'iconv -f "windows-1252" -t "UTF-8" $1 > $1.converted && mv $1.converted $1' -- {} \;
find . -maxdepth 1 -type f finds all files in the working directory
-exec ... executes a command on each such file (including correct handling of e.g. spaces or newlines in the filename)
bash -c '...' executes the command in '...' in a subshell (easier to do the subsequent steps, involving multiple expansions of the filename, this way)
-- terminates option processing, and treats anything after the -- as arguments to the call.
{} is replaced by find with the file name(s) found
$1 in the bash command is replaced with the first (and only) argument, which is the {} replaced by the filename (see above)
\; tells find where the -exec'ed command ends.
Building upon the existing question that you referenced, Why don't you just remove the first 2 characters i.e. ./?
find . -maxdepth 1 -type f | cut -c 3-
Edit: I agree with #DevSolar about the space-based problem in the for-loop. While I think that his solution is better for this problem, I just want to give an alternative way to get out of the space-based for-loop issue.
OLD_IFS=$IFS
IFS=$'\n'
for item in $(find . -maxdepth 1 -type f | cut -c 3-); do
FileName=$item
iconv -f "windows-1252" -t "UTF-8" FileName -o FileName
done
IFS=$OLD_IFS

Merge pdf files with numerical sort

I am trying to write a bash script to merge all pdf files of a directory into one single pdf file. The command pdfunite *.pdf output.pdf successfully achieves this but it merges the input documents in a regular order:
1.pdf
10.pdf
11.pdf
2.pdf
3.pdf
4.pdf
5.pdf
6.pdf
7.pdf
8.pdf
9.pdf
while I'd like the documents to be merged in a numerical order:
1.pdf
2.pdf
3.pdf
4.pdf
5.pdf
6.pdf
7.pdf
8.pdf
9.pdf
10.pdf
11.pdf
I guess a command mixing ls -v or sort -n and pdfunite would do the trick but I don't know how to combine them.
Any idea on how I could merge pdf files with a numerical sort?
you can embed the result of command using $(),
so you can do following
$ pdfunite $(ls -v *.pdf) output.pdf
or
$ pdfunite $(ls *.pdf | sort -n) output.pdf
However, note that this does not work when filename contains special character such as whitespace.
In the case you can do the following:
ls -v *.txt | bash -c 'IFS=$'"'"'\n'"'"' read -d "" -ra x;pdfunite "${x[#]}" output.pdf'
Although it seems a little bit complicated, its just combination of
Bash: Read tab-separated file line into array
build argument lists containing whitespace
How to escape single-quotes within single-quoted strings?
Note that you cannot use xargs since pdfunite requires input pdf's as the middle of arguments.
I avoided using readarray since it is not supported in older bash version, but you can use it instead of IFS=.. read -ra .. if you have newer bash.
Do it in multiple steps. I am assuming you have files from 1 to 99.
pdfunite $(find ./ -regex ".*[^0-9][0-9][^0-9].*" | sort) out1.pdf
pdfunite out1.pdf $(find ./ -regex ".*[^0-9]1[0-9][^0-9].*" | sort) out2.pdf
pdfunite out2.pdf $(find ./ -regex ".*[^0-9]2[0-9][^0-9].*" | sort) out3.pdf
and so on.
the final file will consist of all your pdfs in numerical order.
!!!
Beware of writing the output file such as out1.pdf etc. otherwise pdfunite will overwrite the last file
!!!
Edit:
Sorry I was missing the [^0-9] in each regex. Corrected it in the above commands.
You can rename your documents i.e. 001.pdf 002.pdf and so on.
destfile=combined.pdf
find . -maxdepth 1 -type f -name '*.pdf' -print0 \
| sort -z -t '/' -k2n \
| { cat; printf '%s\0' "$destfile"; } \
| xargs -0 -x pdfunite
Variable destfile holds the name of the destination pdf file.
The find command finds all the pdf files in the current directory and outputs them as a NUL delimited list.
The sort command reads the NUL delimited list of filenames. It specifies a field delimiter of /. It sorts by the 2nd field numerically. (Recall that the output of find looks like ./11.pdf ....)
We append destfile before sending to xargs, being sure to end it with a NUL.
xargs reads the NUL delimited args and supplies them to the pdfunite command. We supplied the -x option so that xargs will exit if the command length is too long. We don't want xargs to execute a partially constructed command.
This solution handles filenames with embedded newlines and spaces.

Move files to directories based on extension

I am new to Linux. I am trying to write a shell script which will move files to certain folders based on their extension, like for example in my downloads folder, I have all files of mixed file types. I have written the following script
mv *.mp3 ../Music
mv *.ogg ../Music
mv *.wav ../Music
mv *.mp4 ../Videos
mv *.flv ../Videos
How can I make it run automatically when a file is added to this folder? Now I have to manually run the script each time.
One more question, is there any way of combining these 2 statements
mv *.mp3 ../../Music
mv *.ogg ../../Music
into a single statement? I tried using || (C programming 'or' operator) and comma but they don't seem to work.
There is no trigger for when a file is added to a directory. If the file is uploaded via a webpage, you might be able to make the webpage do it.
You can put a script in crontab to do this, on unix machines (or task schedular in windows). Google crontab for a how-to.
As for combining your commands, use the following:
mv *.mp3 *.ogg ../../Music
You can include as many different "globs" (filenames with wildcards) as you like. The last thing should be the target directory.
Two ways:
find . -name '*mp3' -or -name '*ogg' -print | xargs -J% mv % ../../Music
find . -name '*mp3' -or -name '*ogg' -exec mv {} ../Music \;
The first uses a pipe and may run out of argument space; while the second may use too many forks and be slower. But, both will work.
Another way is:
mv -v {*.mp3,*.ogg,*.wav} ../Music
mv -v {*.mp4,*.flv} ../Videos
PS: option -v shows what is going on (verbose).
I like this method:
#!/bin/bash
for filename in *; do
if [[ -f "$filename" ]]; then
base=${filename%.*}
ext=${filename#$base.}
mkdir -p "${ext}"
mv "$filename" "${ext}"
fi
done
incron will watch the filesystem and perform run commands upon certain events.
You can combine multiple commands on a single line by using a command separator. The unconditional serialized command separator is ;.
command1 ; command2
You can use for loop to traverse through folders and subfolders inside the source folder.
The following code will help you move files in pair from "/source/foler/path/" to "/destination/fodler/path/". This code will move file matching their name and having different extensions.
for d in /source/folder/path/*; do
ls -tr $d |grep txt | rev | cut -f 2 -d '.' | rev | uniq | head -n 4 | xargs -I % bash -c 'mv -v '$d'/%.{txt,csv} /destination/folder/path/'
sleep 30
done

Merge multiple JPGs into single PDF in Linux

I used the following command to convert and merge all the JPG files in a directory to a single PDF file:
convert *.jpg file.pdf
The files in the directory are numbered from 1.jpg to 123.jpg. The conversion went fine but after converting, the pages were all mixed up. I wanted the PDF to have pages from 1.jpg to 123.jpg in the same order as they are named. I tried it with the following command as well:
cd 1
FILES=$( find . -type f -name "*jpg" | cut -d/ -f 2)
mkdir temp && cd temp
for file in $FILES; do
BASE=$(echo $file | sed 's/.jpg//g');
convert ../$BASE.jpg $BASE.pdf;
done &&
pdftk *pdf cat output ../1.pdf &&
cd ..
rm -rf temp
But still no luck. Operating system is Linux.
From the manual of ls:
-v natural sort of (version) numbers within text
So, doing what we need in a single command:
convert $(ls -v *.jpg) foobar.pdf
Mind that convert is part of ImageMagick.
The problem is because your shell is expanding the wildcard in a purely alphabetical order, and because the lengths of the numbers are different, the order will be incorrect:
$ echo *.jpg
1.jpg 10.jpg 100.jpg 101.jpg 102.jpg ...
The solution is to pad the filenames with zeros as required so they're the same length before running your convert command:
$ for i in *.jpg; do num=`expr match "$i" '\([0-9]\+\).*'`;
> padded=`printf "%03d" $num`; mv -v "$i" "${i/$num/$padded}"; done
Now the files will be matched by the wildcard in the correct order, ready for the convert command:
$ echo *.jpg
001.jpg 002.jpg 003.jpg 004.jpg 005.jpg 006.jpg 007.jpg 008.jpg ...
You could use
convert '%d.jpg[1-132]' file.pdf
via https://www.imagemagick.org/script/command-line-processing.php:
Another method of referring to other image files is by embedding a
formatting character in the filename with a scene range. Consider the
filename image-%d.jpg[1-5]. The command
magick image-%d.jpg[1-5] causes ImageMagick to attempt to read images
with these filenames:
image-1.jpg image-2.jpg image-3.jpg image-4.jpg image-5.jpg
See also https://www.imagemagick.org/script/convert.php
All of the above answers failed for me, when I wanted to merge many high-resolution jpeg images (from a scanned book).
Imagemagick tried to load all files into RAM, I therefore used the following two-step approach:
find -iname "*.JPG" | xargs -I'{}' convert {} {}.pdf
pdfunite *.pdf merged_file.pdf
Note that with this approach, you can also use GNU parallel to speed up the conversion:
find -iname "*.JPG" | parallel -I'{}' convert {} {}.pdf
This is how I do it:
First line convert all jpg files to pdf it is using convert command.
Second line is merging all pdf files to one single as pdf per page. This is using gs ((PostScript and PDF language interpreter and previewer))
for i in $(find . -maxdepth 1 -name "*.jpg" -print); do convert $i ${i//jpg/pdf}; done
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=merged_file.pdf -dBATCH `find . -maxdepth 1 -name "*.pdf" -print"`
https://gitlab.mister-muffin.de/josch/img2pdf
In all of the proposed solutions involving ImageMagick, the JPEG data gets fully decoded and re-encoded. This results in generation loss, as well as performance "ten to hundred" times worse than img2pdf.
img2pdf is also available from many Linux distros, as well as via pip3.
Mixing first idea with their reply, I think this code maybe satisfactory
jpgs2pdf.sh
#!/bin/bash
cd $1
FILES=$( find . -type f -name "*jpg" | cut -d/ -f 2)
mkdir temp > /dev/null
cd temp
for file in $FILES; do
BASE=$(echo $file | sed 's/.jpg//g');
convert ../$BASE.jpg $BASE.pdf;
done &&
pdftk `ls -v *pdf` cat output ../`basename $1`.pdf
cd ..
rm -rf temp
How to create A PDF document from a list of images
Step 1: Install parallel from Repository. This will speed up the process
Step 2: Convert each jpg to pdf file
find -iname "*.JPG" | sort -V | parallel -I'{}' convert -compress jpeg -quality 25 {} {}.pdf
The sort -V will sort the file names in natural order.
Step 3: Merge all PDFs into one
pdfunite $(find -iname '*.pdf' | sort -V) output_document.pdf
Credit Gregor Sturm
Combining Felix Defrance's and Delan Azabani's answer(from above):
convert `for file in $FILES; do echo $file; done` test_2.pdf

Resources