Saving multiple URLs at once using Linux Centos - linux

So i have a list of about 1000 urls in a txt file, one per line, I wish to save the contents of every page to a file, how can i automate this?"

You can use wget with the -i option to let it download a list of URLs. Assuming your URLs are stored in a file called urls.txt:
wget -i urls.txt
The problem here might be that the filenames can be the same for multiple websites (e.g. index.html), so that wget will append a number which makes it hard/impossible to connect a file to the original URL just by looking at the filename.
The solution to that would be to use a loop like this:
while read -r line
do
wget "$line" -O <...>
done < urls.txt
You can specify a custom filename with the -O option.
Or you can "build" the file name from the url you are processing.
while read -r line
do
fname=$(echo "$line" | sed -e 's~http[s]*://~~g' -e 's~[^A-Za-z0-9]~-~g')
fname=${fname}.html
wget "$line" -O "$fname"
done < urls.txt

Related

Linux: batch filename change adding creation date

i have a directory with a lot of sub-directories including files.
For each WAV file i would like to rename WAV file by adding creation date (date when file WAV has been firstly created) at the beginning of the file (without changing timestamps of file itself).
Next step would be to convert the WAV file to MP3 file, so i will save hard drive space.
for that purpose, i'm trying to create a bash script but i'm having some issues.
I want to keep the same structure as original directory and therefore i was thinking of something like:
for file in `ls -1 *.wav`
do name=`stat -c %y $file | awk -F"." '{ print $1 }' | sed -e "s/\-//g" -e "s/\://g" -e "s/[ ]/_/g"`.wav
cp -r --preserve=timestampcp $dir_original/$file $dir_converted/$name
done
Don't use ls to generate a list of file names, just let the shell glob them (that's what ls *.wav does anyway):
for file in ./*.wav ; do
I think you want the timestamp in the format YYYYMMDD_HHMMSS ?
You could use GNU date with stat to have a somewhat neater control of the output format:
epochtime=$(stat -c %Y "$file" )
name=$(date -d "#$epochtime" +%Y%m%d_%H%M%S).wav
stat -c %Y (or %y) gives the last modification date, but you can't really get the date of the file creation on Linux systems.
That cp looks ok, except for the stray cp at the end of timestampcp, but that must be a typo. If you do *.wav, the file names will be relative to current directory anyway, so no need to prefix with $dir_original/.
If you want to walk through a whole subdirectory, use Bash's globstar feature, or find. Something like this:
shopt -s globstar
cd "$sourcedir"
for file in ./**/*.wav ; do
epochtime=$(stat -c %Y "$file" )
name=$(date -d "#$epochtime" +%Y%m%d_%H%M%S).wav
dir=$(dirname "$file")
mkdir -p "$target/$dir"
cp -r --preserve=timestamp "$file" "$target/$dir/$name"
done
The slight inconvenience here is that cp can't create the directories in the path, so we need to use mkdir there. Also, I'm not sure if you wanted to keep the original filename as part of the resulting one, this would remove it and just replace the file names with the timestamp.
I did some experimenting with the calculation of name to see if I could get it more succinctly, and came up with this:
name=$(date "+%Y%m%d_%H%M%S" -r "$file")
I wanted to append all file names in that folder with the date they were created , and below works perfectly.
#############################
#!/bin/sh
for file in `ls *.JPG`;
do
mv -f "$file" "$(date -r "$file" +"%Y%m%d_%H_%M_%S")_"$file".jpg"
done
##############################

Linux bash output fdirectory files to a text file with xargs and add new line

I want to generate a text file with the list of files present in the folder
ls | xargs echo > text.txt
I want to prepend the IP address to each file so that I can run parallel wget as per this post : Parallel wget in Bash
So my text.txt file content will have these lines :
123.123.123.123/file1
123.123.123.123/file2
123.123.123.123/file3
How can I append a string as the ls feeds xargs? (and also add line break at the end.)
Thank you
Simply printf and globbing to get the filenames:
printf '123.123.123.123/%s\n' * >file.txt
Or longer approach, leverage a for construct with help from globbing:
for f in *; do echo "123.123.123.123/$f"; done >file.txt
Assuming no filename with newline exists.

Bash loop to gunzip file and remove file extension and file prefixes

I have several .vcf.gz files:
subset_file1.vcf.vcf.gz
subset_file2.vcf.vcf.gz
subset_file3.vcf.vcf.gz
I want to gunzip these file and rename them (remove subset_ and redudant .vcf extension in one go and get these files:
file1.vcf
file2.vcf
file3.vcf
This is the script I have tried:
iFILES=/file/path/*.gz
for i in $iFILES;
do gunzip -k $i > /get/in/this/dir/"${i##*/}"
done
Since you have to three operation at your output path name
1.remove the directory part
2.remove prefix subset_
3.remove redudant extension .vcf
It's hard to accomplish with only one command.
Following is a modification version. Be CAREFUL to try it. I didn't test it thorough in my computer.
for i in /file/path/*.gz;
do
# get the output file name
o=$(echo ${i##*/} | sed 's/.*_\(.*\)\(\.[a-z]\{3\}\)\{2\}.*/\1\2/g')
gunzip -k $i > /get/in/this/dir/$o
done

Open multiple webpages, use wget and merge the output

I have a text file containing a bunch of webpages:
http://rest.kegg.jp/link/pathway/7603
http://rest.kegg.jp/link/pathway/5620
…
My aim is to download all info on these pages to a single text file.
The following works perfectly but it gives me 3000+ text files, how could I simple merge all the output files during the loop.
while read i; do wget $i; done < urls.txt
Thanks a lot
Use -O file option, which appends the output to the logfile specified.
while read i; do wget -O outputFile $i; done < urls.txt
The outputFile will contain the contents as well
Also you can skip the while loop by specifying the input file using -i file
wget -O outpuFile -i url.txt

How can I download and set the filenames using wget -i?

I wanted to know how to define the input file for wget, in order to download several files and set a custom name for them, using wget -i filename
Analog example using -O
wget -O customname url
-O filename works only when you give it a single URL.
With multiple URLs, all downloaded content ends up in filename.
You can use a while...loop:
cat urls.txt | while read url
do
wget "$url" -O "${url##*/}" # <-- use custom name here
done

Resources