I used the following command to convert and merge all the JPG files in a directory to a single PDF file:
convert *.jpg file.pdf
The files in the directory are numbered from 1.jpg to 123.jpg. The conversion went fine but after converting, the pages were all mixed up. I wanted the PDF to have pages from 1.jpg to 123.jpg in the same order as they are named. I tried it with the following command as well:
cd 1
FILES=$( find . -type f -name "*jpg" | cut -d/ -f 2)
mkdir temp && cd temp
for file in $FILES; do
BASE=$(echo $file | sed 's/.jpg//g');
convert ../$BASE.jpg $BASE.pdf;
done &&
pdftk *pdf cat output ../1.pdf &&
cd ..
rm -rf temp
But still no luck. Operating system is Linux.
From the manual of ls:
-v natural sort of (version) numbers within text
So, doing what we need in a single command:
convert $(ls -v *.jpg) foobar.pdf
Mind that convert is part of ImageMagick.
The problem is because your shell is expanding the wildcard in a purely alphabetical order, and because the lengths of the numbers are different, the order will be incorrect:
$ echo *.jpg
1.jpg 10.jpg 100.jpg 101.jpg 102.jpg ...
The solution is to pad the filenames with zeros as required so they're the same length before running your convert command:
$ for i in *.jpg; do num=`expr match "$i" '\([0-9]\+\).*'`;
> padded=`printf "%03d" $num`; mv -v "$i" "${i/$num/$padded}"; done
Now the files will be matched by the wildcard in the correct order, ready for the convert command:
$ echo *.jpg
001.jpg 002.jpg 003.jpg 004.jpg 005.jpg 006.jpg 007.jpg 008.jpg ...
You could use
convert '%d.jpg[1-132]' file.pdf
via https://www.imagemagick.org/script/command-line-processing.php:
Another method of referring to other image files is by embedding a
formatting character in the filename with a scene range. Consider the
filename image-%d.jpg[1-5]. The command
magick image-%d.jpg[1-5] causes ImageMagick to attempt to read images
with these filenames:
image-1.jpg image-2.jpg image-3.jpg image-4.jpg image-5.jpg
See also https://www.imagemagick.org/script/convert.php
All of the above answers failed for me, when I wanted to merge many high-resolution jpeg images (from a scanned book).
Imagemagick tried to load all files into RAM, I therefore used the following two-step approach:
find -iname "*.JPG" | xargs -I'{}' convert {} {}.pdf
pdfunite *.pdf merged_file.pdf
Note that with this approach, you can also use GNU parallel to speed up the conversion:
find -iname "*.JPG" | parallel -I'{}' convert {} {}.pdf
This is how I do it:
First line convert all jpg files to pdf it is using convert command.
Second line is merging all pdf files to one single as pdf per page. This is using gs ((PostScript and PDF language interpreter and previewer))
for i in $(find . -maxdepth 1 -name "*.jpg" -print); do convert $i ${i//jpg/pdf}; done
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=merged_file.pdf -dBATCH `find . -maxdepth 1 -name "*.pdf" -print"`
https://gitlab.mister-muffin.de/josch/img2pdf
In all of the proposed solutions involving ImageMagick, the JPEG data gets fully decoded and re-encoded. This results in generation loss, as well as performance "ten to hundred" times worse than img2pdf.
img2pdf is also available from many Linux distros, as well as via pip3.
Mixing first idea with their reply, I think this code maybe satisfactory
jpgs2pdf.sh
#!/bin/bash
cd $1
FILES=$( find . -type f -name "*jpg" | cut -d/ -f 2)
mkdir temp > /dev/null
cd temp
for file in $FILES; do
BASE=$(echo $file | sed 's/.jpg//g');
convert ../$BASE.jpg $BASE.pdf;
done &&
pdftk `ls -v *pdf` cat output ../`basename $1`.pdf
cd ..
rm -rf temp
How to create A PDF document from a list of images
Step 1: Install parallel from Repository. This will speed up the process
Step 2: Convert each jpg to pdf file
find -iname "*.JPG" | sort -V | parallel -I'{}' convert -compress jpeg -quality 25 {} {}.pdf
The sort -V will sort the file names in natural order.
Step 3: Merge all PDFs into one
pdfunite $(find -iname '*.pdf' | sort -V) output_document.pdf
Credit Gregor Sturm
Combining Felix Defrance's and Delan Azabani's answer(from above):
convert `for file in $FILES; do echo $file; done` test_2.pdf
Related
I have the following cmd that fetches all .pdf files with an STP pattern in the filename and places them into a folder:
find /home/OurFiles/Images/ -name '*.pdf' |grep "STP*" | xargs cp -t /home/OurFiles/ImageConvert/STP/
I have another cmd that converts pdf to jpg.
find /home/OurFiles/ImageConvert/STP/ -type f -name '*.pdf' -print0 |
while IFS= read -r -d '' file
do convert -verbose -density 500 -resize 800 "${file}" "${file%.*}.jpg"
done
Is it possible to combine these commands into one? Also, I would like pre-pend a prefix onto the converted image file name in the single command, if possible. Example: STP_OCTOBER.jpg to MSP-STP_OCTOBER.jpg. Any feedback is much appreciated.
find /home/OurFiles/Images/ -type f -name '*STP*.pdf' -exec sh -c '
destination=$1; shift # get the first argument
for file do # loop over the remaining arguments
fname=${file##*/} # get the filename part
cp "$file" "$destination" &&
convert -verbose -density 500 -resize 800 "$destination/$fname" "$destination/MSP-${fname%pdf}jpg"
done
' sh /home/OurFiles/ImageConvert/STP {} +
You could pass the destination directory and all PDFs found to find's -exec option to execute a small script.
The script removes the first argument and saves it to variable destination and then loops over the given PDF paths. For each filepath, extract the filename, copy the file to the destination directory and run the convert command if the copy operation was successful.
Maybe something like:
find /home/OurFiles/Images -type f -name 'STP*.pdf' -print0 |
while IFS= read -r -d '' file; do
destfile="/home/OurFiles/ImageConvert/STP/MSP-$(basename "$file" .pdf).jpg"
convert -verbose -density 500 -resize 800 "$file" "$destfile"
done
The only really new thing in this merged one compared to your two separate commands is using basename(1) to strip off the directories and extension from the filename in order to create the output filename.
I have bash script that that loops through files in the raw folder and puts them into the audio folder. This works just fine.
#!/bin/bash
PATH_IN=('/nas/data/customers/test2/raw/')
PATH_OUT=('/nas/data/customers/test2/audio/')
mkdir -p /nas/data/customers/test2/audio
IFS=$'\n'
find $PATH_IN -type f -name '*.wav' -exec basename {} \; | while read -r file; do
sox -S ${PATH_IN}${file} -e signed-integer ${PATH_OUT}${file}
done
My issue is that, as the folders grow I do not want to run the script on the files that has already been converted, so I would like to loop over only the files that has not been converted yet. I.e the files only in raw but not in audio.
I found the function
diff audio raw
That can I do just that, but I cannot find a good way to incorporate this into my bash script. Any help or nudges in the right direction would be highly appreciated.
You could do:
diff <(ls -1a $PATH_OUT) <(ls -1a $PATH_IN) | grep -E ">" | sed -E 's/> //'
The first part will diff the files on both folders, the second part will filter out to get only the additions, and the third one will clean the list from the diff symbols to get just the names.
I am faced with a challenge that requires multiple aspects of bash. I work in Linux (precisely Debian Stretch). Here is the situation (for all points/problem I write along the solution I considered for now, but I'm open to other ideas) :
I have videos of various types (and various upper-lower case), such as .mp4, .mov, .MOV, .MP4, .avi,... located in a directory (and spread across an almost un-structured tree of directories). To find all I tried to use the find command
For each video, I need to extract some metadata (i.e. the name of the file, duration of video, size of file and date of creation/last modification). The package mediainfo yields (among a lot of other things) the required fields.
The output of mediainfo is a long list of fields with format : <Tag>\t : <value>. I need to extract values for fields Complete name, Duration, File size and Encoded date.
So with all this information, I must filter the required fields value and put them in a CSV file. I considered using sed.
My goal is to achieve all these tasks either in a script or a small amount of separate commands.
The idea code (this code is hideously wrong, but you can get an idea) :
find . -type f -name "*.[mp4|MP4|mov|MOV|avi|AVI]" -exec mediainfo {} | sed '/Complete name|Duration|File size|Encoded date/p' > myfile.csv \;
Would you have any idea how to perform this task ? I feel terribly lost in combining find, exec and sed and outputting to a csv...
Thanks in advance for your help !
So I finally managed to write a script doing that. Probably not the best way to do, but here it is :
resFile="myresult.csv"
dstDir="./destination/"
srcDir="./source/"
#first copy all files at same level in dstDir (with preserve and update)
#this is somehow necessary, relative name for MOV files and mediainfo
#do not seem to work together.
find $srcDir -type f \( -name "*.mp4" -o -name "*.mov" -o -name "*.MOV" -o -name "*.avi" \) -exec cp -up {} $dstDir \;
#then for each file, output mediainfo of file and keep only interesting tags. add ### between each file.
find $dstDir -type f \( -name "*.mp4" -o -name "*.mov" -o -name "*.MOV" -o -name "*.avi"\
-exec sh -c " mediainfo --Output=XML {} | sed '1,15!d;/Duration\|Complete\|File_size\|Encoded_date/!d' >> $resFile && echo '########' >> $resFile" \;
#removes tags : <Duration>42s 15ms</Duration> -> 42s 15ms
sed -i 's/^<.*>\(.*\)<.*>/\1/I' $resFile
#Extract exact filename (and not relative)
sed -i 's/^\.\/.*\/\(.*\)\.[mp4|MOV|mov|avi|MP4]/\1/' $resFile
#Puts fields for a file on a unique line separated with commas
sed -i 'N;s/\n/,/;N;s/\n/,/;N;s/\n/,/;N;s/\n/,/' $resFile
#remove all trailing ###
sed -i 's/,#*$//' $resFile
I would still be interested if anyone has idea to improve the code.
I "minimized" a little bit, my actual code is a bit more modular and performs a few checks
Try this. Due to less time,I was not able to complete. You just have to send output to CSV.
for c in $(locate --basename .mp4 .mkv .wmv .flv .webm .mov .avi)
do
Complete_name=$(mediainfo --Output=XML $c | xml_grep 'Complete_name' --text_only| awk 'BEGIN{FS="/"}{print $NF}')
echo $Complete_name
Duration=$(mediainfo --Output=XML $c | xml_grep 'Duration' --text_only --nb_result 1)
echo $Duration
File_size=$(mediainfo --Output=XML $c | xml_grep 'File_size' --text_only)
echo $File_size
Encoded_date=$(mediainfo --Output=XML $c | xml_grep 'Encoded_date' --text_only -nb_result 1 | awk '{print $2}')
echo $Encoded_date
done
I have a directory with a few hundred PDFs in it.
All of the PDFs filenames begin with a 5 digit number (and then have a bunch of other stuff at the end).
What I need to do is merge any PDFs together that start with the same 5 digit number.
Thoughts on how to do this via a shell script? Or other options? I'm using pdftk on Ubuntu.
Try this:
find . -type f -iname "[0-9][0-9][0-9][0-9][0-9]*.pdf" -printf "%.5f\n" \
| sort -u \
| while read -r file; do
echo pdftk ${file}*.pdf cat output $file.pdf ;
done
If output is okay, remove echo.
Alright, so simple problem here. I'm working on a simple back up code. It works fine except if the files have spaces in them. This is how I'm finding files and adding them to a tar archive:
find . -type f | xargs tar -czvf backup.tar.gz
The problem is when the file has a space in the name because tar thinks that it's a folder. Basically is there a way I can add quotes around the results from find? Or a different way to fix this?
Use this:
find . -type f -print0 | tar -czvf backup.tar.gz --null -T -
It will:
deal with files with spaces, newlines, leading dashes, and other funniness
handle an unlimited number of files
won't repeatedly overwrite your backup.tar.gz like using tar -c with xargs will do when you have a large number of files
Also see:
GNU tar manual
How can I build a tar from stdin?, search for null
There could be another way to achieve what you want. Basically,
Use the find command to output path to whatever files you're looking for. Redirect stdout to a filename of your choosing.
Then tar with the -T option which allows it to take a list of file locations (the one you just created with find!)
find . -name "*.whatever" > yourListOfFiles
tar -cvf yourfile.tar -T yourListOfFiles
Try running:
find . -type f | xargs -d "\n" tar -czvf backup.tar.gz
Why not:
tar czvf backup.tar.gz *
Sure it's clever to use find and then xargs, but you're doing it the hard way.
Update: Porges has commented with a find-option that I think is a better answer than my answer, or the other one: find -print0 ... | xargs -0 ....
If you have multiple files or directories and you want to zip them into independent *.gz file you can do this. Optional -type f -atime
find -name "httpd-log*.txt" -type f -mtime +1 -exec tar -vzcf {}.gz {} \;
This will compress
httpd-log01.txt
httpd-log02.txt
to
httpd-log01.txt.gz
httpd-log02.txt.gz
Would add a comment to #Steve Kehlet post but need 50 rep (RIP).
For anyone that has found this post through numerous googling, I found a way to not only find specific files given a time range, but also NOT include the relative paths OR whitespaces that would cause tarring errors. (THANK YOU SO MUCH STEVE.)
find . -name "*.pdf" -type f -mtime 0 -printf "%f\0" | tar -czvf /dir/zip.tar.gz --null -T -
. relative directory
-name "*.pdf" look for pdfs (or any file type)
-type f type to look for is a file
-mtime 0 look for files created in last 24 hours
-printf "%f\0" Regular -print0 OR -printf "%f" did NOT work for me. From man pages:
This quoting is performed in the same way as for GNU ls. This is not the same quoting mechanism as the one used for -ls and -fls. If you are able to decide what format to use for the output of find then it is normally better to use '\0' as a terminator than to use newline, as file names can contain white space and newline characters.
-czvf create archive, filter the archive through gzip , verbosely list files processed, archive name
Edit 2019-08-14:
I would like to add, that I was also able to use essentially use the same command in my comment, just using tar itself:
tar -czvf /archiveDir/test.tar.gz --newer-mtime=0 --ignore-failed-read *.pdf
Needed --ignore-failed-read in-case there were no new PDFs for today.
Why not give something like this a try: tar cvf scala.tar `find src -name *.scala`
Another solution as seen here:
find var/log/ -iname "anaconda.*" -exec tar -cvzf file.tar.gz {} +
The best solution seem to be to create a file list and then archive files because you can use other sources and do something else with the list.
For example this allows using the list to calculate size of the files being archived:
#!/bin/sh
backupFileName="backup-big-$(date +"%Y%m%d-%H%M")"
backupRoot="/var/www"
backupOutPath=""
archivePath=$backupOutPath$backupFileName.tar.gz
listOfFilesPath=$backupOutPath$backupFileName.filelist
#
# Make a list of files/directories to archive
#
echo "" > $listOfFilesPath
echo "${backupRoot}/uploads" >> $listOfFilesPath
echo "${backupRoot}/extra/user/data" >> $listOfFilesPath
find "${backupRoot}/drupal_root/sites/" -name "files" -type d >> $listOfFilesPath
#
# Size calculation
#
sizeForProgress=`
cat $listOfFilesPath | while read nextFile;do
if [ ! -z "$nextFile" ]; then
du -sb "$nextFile"
fi
done | awk '{size+=$1} END {print size}'
`
#
# Archive with progress
#
## simple with dump of all files currently archived
#tar -czvf $archivePath -T $listOfFilesPath
## progress bar
sizeForShow=$(($sizeForProgress/1024/1024))
echo -e "\nRunning backup [source files are $sizeForShow MiB]\n"
tar -cPp -T $listOfFilesPath | pv -s $sizeForProgress | gzip > $archivePath
Big warning on several of the solutions (and your own test) :
When you do : anything | xargs something
xargs will try to fit "as many arguments as possible" after "something", but then you may end up with multiple invocations of "something".
So your attempt: find ... | xargs tar czvf file.tgz
may end up overwriting "file.tgz" at each invocation of "tar" by xargs, and you end up with only the last invocation! (the chosen solution uses a GNU -T special parameter to avoid the problem, but not everyone has that GNU tar available)
You could do instead:
find . -type f -print0 | xargs -0 tar -rvf backup.tar
gzip backup.tar
Proof of the problem on cygwin:
$ mkdir test
$ cd test
$ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs touch
# create the files
$ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs tar czvf archive.tgz
# will invoke tar several time as it can'f fit 10000 long filenames into 1
$ tar tzvf archive.tgz | wc -l
60
# in my own machine, I end up with only the 60 last filenames,
# as the last invocation of tar by xargs overwrote the previous one(s)
# proper way to invoke tar: with -r (which append to an existing tar file, whereas c would overwrite it)
# caveat: you can't have it compressed (you can't add to a compressed archive)
$ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs tar rvf archive.tar #-r, and without z
$ gzip archive.tar
$ tar tzvf archive.tar.gz | wc -l
10000
# we have all our files, despite xargs making several invocations of the tar command
Note: that behavior of xargs is a well know diccifulty, and it is also why, when someone wants to do :
find .... | xargs grep "regex"
they intead have to write it:
find ..... | xargs grep "regex" /dev/null
That way, even if the last invocation of grep by xargs appends only 1 filename, grep sees at least 2 filenames (as each time it has: /dev/null, where it won't find anything, and the filename(s) appended by xargs after it) and thus will always display the file names when something maches "regex". Otherwise you may end up with the last results showing matches without a filename in front.