Open multiple webpages, use wget and merge the output

Open multiple webpages, use wget and merge the output - linux

I have a text file containing a bunch of webpages:
http://rest.kegg.jp/link/pathway/7603
http://rest.kegg.jp/link/pathway/5620
…
My aim is to download all info on these pages to a single text file.
The following works perfectly but it gives me 3000+ text files, how could I simple merge all the output files during the loop.
while read i; do wget $i; done < urls.txt
Thanks a lot

Use -O file option, which appends the output to the logfile specified.
while read i; do wget -O outputFile $i; done < urls.txt
The outputFile will contain the contents as well
Also you can skip the while loop by specifying the input file using -i file
wget -O outpuFile -i url.txt

Related

Get wget to download only new items from a list

I've got a file that contains a list of file paths. I’m downloading them like this with wget:
wget -i cram_download_list.txt
However the list is long and my session gets interrupted. I’d like to look at the directory for which files already exist, and only download the outstanding ones.
I’ve been trying to com up with an option involving comm, but can’t work out how to loop it in with wget.
File contents look like this:
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239280/NA07037.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239286/NA11829.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239293/NA11918.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239298/NA11994.final.cram
I’m currently trying to do something like this:
ls *.cram | sed 's/^/ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/run\/ERR323\/ERR3239480\//' > downloaded.txt
comm -3 <(sort cram_download_list.txt) <(sort downloaded.txt) | tr -d " \t" > to_download.txt
wget -i to_download_final.txt

I’d like to look at the directory for which files already exist, and
only download the outstanding ones.
To get such behavior you might use -nc (alias --no-clobber) flag. It does skip downloads that would download to existing files (overwriting them). So in your case
wget -nc -i cram_download_list.txt
Beware that this solution does not handle partially downloaded files.

wget -c -i <(find -type f -name '*.cram' -printf '%f$\n' |\
grep -vf - cram_download_list.txt )
Finds files ending in cram and prints them followed by a $ and a newline. This is used as for an inverted regex match list for your download list, i.e. removes any lines ending in the existing file names from your download list.
Added:
-c for finalizing incomplete files (i.e. resume download)
Note: does not handle spaces or newlines in file names well, but these are ftp-URLs so that should not be a problem in the first place.

If you also want to handle partial transferred files, you always need to pass in the complete set of filenames that wget is able to check the length. Which means that for this scenario the only way is:
wget -c -i cram_download_list.txt
The files which are already completed will only be checked and skipped.

Loop in bash script

I have a directory containing gzipped datafiles. I want to run each file using the script est_abundance.py. But first i need to unzip them. So i have this bash:
for file in /home/doy.user/scratch1/Secoutput/; do
cd "$file"
gunzip *kren.gz
python analysis1.py -i /Secoutput/*kren -k gkd_output -o /bracken_output/$(basename *kren).txt
wait
done
The problem is, the bash script keeps on unzipping all of the datafiles, it does not continue to the next command after unzipping one file.
Can you help me correct this? I just want every command to be done for every file.

Use, notice that you should use $file variable, and you can get the name of the file after unzipping by stripping the .gz part using ${file%.gz}:
for file in /home/doy.user/scratch1/Secoutput/*; do
gunzip $file
python analysis1.py -i ${file%.gz} -k gkd_output -o /bracken_output/$(basename ${file%.gz}).txt
wait
done

Bash loop to gunzip file and remove file extension and file prefixes

I have several .vcf.gz files:
subset_file1.vcf.vcf.gz
subset_file2.vcf.vcf.gz
subset_file3.vcf.vcf.gz
I want to gunzip these file and rename them (remove subset_ and redudant .vcf extension in one go and get these files:
file1.vcf
file2.vcf
file3.vcf
This is the script I have tried:
iFILES=/file/path/*.gz
for i in $iFILES;
do gunzip -k $i > /get/in/this/dir/"${i##*/}"
done

Since you have to three operation at your output path name
1.remove the directory part
2.remove prefix subset_
3.remove redudant extension .vcf
It's hard to accomplish with only one command.
Following is a modification version. Be CAREFUL to try it. I didn't test it thorough in my computer.
for i in /file/path/*.gz;
do
# get the output file name
o=$(echo ${i##*/} | sed 's/.*_\(.*\)\(\.[a-z]\{3\}\)\{2\}.*/\1\2/g')
gunzip -k $i > /get/in/this/dir/$o
done

Saving multiple URLs at once using Linux Centos

So i have a list of about 1000 urls in a txt file, one per line, I wish to save the contents of every page to a file, how can i automate this?"

You can use wget with the -i option to let it download a list of URLs. Assuming your URLs are stored in a file called urls.txt:
wget -i urls.txt
The problem here might be that the filenames can be the same for multiple websites (e.g. index.html), so that wget will append a number which makes it hard/impossible to connect a file to the original URL just by looking at the filename.
The solution to that would be to use a loop like this:
while read -r line
do
wget "$line" -O <...>
done < urls.txt
You can specify a custom filename with the -O option.
Or you can "build" the file name from the url you are processing.
while read -r line
do
fname=$(echo "$line" | sed -e 's~http[s]*://~~g' -e 's~[^A-Za-z0-9]~-~g')
fname=${fname}.html
wget "$line" -O "$fname"
done < urls.txt

How can I download and set the filenames using wget -i?

I wanted to know how to define the input file for wget, in order to download several files and set a custom name for them, using wget -i filename
Analog example using -O
wget -O customname url

-O filename works only when you give it a single URL.
With multiple URLs, all downloaded content ends up in filename.
You can use a while...loop:
cat urls.txt | while read url
do
wget "$url" -O "${url##*/}" # <-- use custom name here
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Open multiple webpages, use wget and merge the output - linux

Use -O file option, which appends the output to the logfile specified. while read i; do wget -O outputFile $i; done < urls.txt The outputFile will contain the contents as well Also you can skip the while loop by specifying the input file using -i file wget -O outpuFile -i url.txt

Related

Get wget to download only new items from a list

Loop in bash script

Bash loop to gunzip file and remove file extension and file prefixes

Saving multiple URLs at once using Linux Centos

How can I download and set the filenames using wget -i?

Categories

Resources