combine multiple pdfs in linux using script? - linux

I want to save/download pdfs from X website and then combined all those pdfs into one, so that it is easy for me to see all of them at once.
What I did,
get pdfs from website
wget -r -l1 -A.pdf --no-parent http://linktoX
combine pdfs into one
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=Combined_date +%F.pdf -dBATCH file1.pdf file2.pdf file3.pdf
My question/problem is, I thought of automating whole this in one script, so that I dont have to do this everyday. Here new pdfs are added daily in X.
So, how can I do step 2 above, without giving full list of all the pdfs, i tried doing file*.pdf in step2; but it combined all pdfs in random order.
Next problem is, total number of file*.pdf is not same everyday, sometimes 5 pdfs sometimes 10...but nice thing is it is named in order file1.pdf file2.pdf ...
So, I need some help to complete above step 2, such that all pdfs are combined in order and I dont have to give name of each pdf explicitly
Thanks.
UPDATE:
This solved the problem
pdftk `ls -rt kanti*.pdf` cat output Kanti.pdf
I did ls -rt as file1.pdf was downloaded first, and then file2.pdf and so on...just doing ls -t put file20.pdf in the start and file1.pdf in last...

I've also used pdftk in the past with good results.
For listing the files in numeric order, you can instruct sort to ignore the first $n - 1 characters of the filename by doing this:
ls | sort -n -k 1.$n
So if you had file*.pdf:
$ ls | sort -n -k 1.5
file1.pdf
file2.pdf
file3.pdf
file4.pdf
file10.pdf
file11.pdf
file20.pdf
file21.pdf

I have used pdftk before for such concatenations as pdftk happens to be readily available to Debian / Ubuntu.

You could do something like:
GSCOMMAND="gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=Combined_date +%F.pdf -dBATCH"
FILES=`ls file*.pdf | sort -n -k 1.5`
$GSCOMMAND $FILES
This is assuming the files are named "file.pdf". See also the post by alberge.
It will do strange things to files with spaces in their name, so you'll need to add escaping if you need to be able to handle names with spaces.
I'm really curious what other people will come up with, as this seems to me quite a quick and dirty solution, but getting better thanks to the answers of other people:)
EDIT
Used the numerical sort command for FILES as suggested by alberge.

Related

Get wget to download only new items from a list

I've got a file that contains a list of file paths. I’m downloading them like this with wget:
wget -i cram_download_list.txt
However the list is long and my session gets interrupted. I’d like to look at the directory for which files already exist, and only download the outstanding ones.
I’ve been trying to com up with an option involving comm, but can’t work out how to loop it in with wget.
File contents look like this:
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239280/NA07037.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239286/NA11829.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239293/NA11918.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239298/NA11994.final.cram
I’m currently trying to do something like this:
ls *.cram | sed 's/^/ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/run\/ERR323\/ERR3239480\//' > downloaded.txt
comm -3 <(sort cram_download_list.txt) <(sort downloaded.txt) | tr -d " \t" > to_download.txt
wget -i to_download_final.txt
I’d like to look at the directory for which files already exist, and
only download the outstanding ones.
To get such behavior you might use -nc (alias --no-clobber) flag. It does skip downloads that would download to existing files (overwriting them). So in your case
wget -nc -i cram_download_list.txt
Beware that this solution does not handle partially downloaded files.
wget -c -i <(find -type f -name '*.cram' -printf '%f$\n' |\
grep -vf - cram_download_list.txt )
Finds files ending in cram and prints them followed by a $ and a newline. This is used as for an inverted regex match list for your download list, i.e. removes any lines ending in the existing file names from your download list.
Added:
-c for finalizing incomplete files (i.e. resume download)
Note: does not handle spaces or newlines in file names well, but these are ftp-URLs so that should not be a problem in the first place.
If you also want to handle partial transferred files, you always need to pass in the complete set of filenames that wget is able to check the length. Which means that for this scenario the only way is:
wget -c -i cram_download_list.txt
The files which are already completed will only be checked and skipped.

Using wget to download images and saving with specified filename

I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg

How to rename files without changing extension in Linux 102221.pdf to 102221_name.pdf

How to rename files without changing extension in Linux \
102221.pdf to 102221_name.pdf
This is what you want I think:
for x in *; do mv "$x" "${x%.*}_name.${x##*.}"; done
${x%.*} will give the name of the file without extention
${x##*.} will extract the extentions
ls * | sed -r 'p;s/\.pdf$/_name\.pdf/g' | xargs -n2 mv
list all the files with ls and pipe the output to sed. sed replaces .pdf with _name.pdf and outputs both the original file name and the new file name to xargs with will call mv with the 2 parameters.
you can also use the rename command which is simpler
rename 's/\.pdf$/_name\.pdf/g' ./*
The regex pattern remains the same though
well i am not so good in linux.. but still found a working answer for you.. hope it will solve ur purpose..
check the given link.. you might need a light weighted tool called as jhead mainly its to get the header information about the file link created date and time and other.. you can find the information which suits you..
Answer
https://superuser.com/questions/90057/linux-rename-file-but-keep-extension
jhead
http://www.sentex.net/~mwandel/jhead/

mention extensions when split (linux)

I have a pretty simple question:
exec('split -d -l 10 _.txt part');
This splits my _.txt file into chunks part00,part01 etc.
Can i set file extension for these chunks somehow?
Thank you,
It is possible by using the --filter option as documented in info coreutils 'split invocation':
split -d -l 10 _.txt part --filter='cat > $FILE.txt'
This will create part00.txt, part01.txt and so on. Also seems to work for binary files (with -b instead of -l).
# touch xaa xab xac; for f in xa{a..c};do echo mv -- "$f" "$f.txt";done

Question on grep

Out of many results returned by grepping a particular pattern, if I want to use all the results one after the other in my script, how can I go about it?For e.g. I grep for .der in a certificate folder which returns many results. I want to use each and every .der certificate listed from the grep command. How can I use one file after the other out of the grep result?
Are you actually grepping content, or just filenames? If it's file names, you'd be better off using the find command:
find /path/to/folder -name "*.der" -exec some other commands {} ";"
It should be quicker in general.
One way is to use grep -l. This ensures you only get every file once. -l is used to print the name of each file only, not the matches.
Then, you can loop on the results:
for file in `grep ....`
do
# work on $file
done
Also note that if you have spaces in your filenames, there is a ton of possible issues. See Looping through files with spaces in the names on the Unix&Linux stackexchange.
You can use the output as part of a for loop, something like:
for cert in $(grep '\.der' *) ; do
echo ${cert} # or something else
done
Of course, if those der things are actually files (and you're using ls | grep to get them), you can directly use the files:
for cert in *.der ; do
echo ${cert} # or something else
done
In both cases, you may need to watch out for arguments with embedded spaces.

Resources