I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg
Related
When I download a file using wget without the -O argument, it saves the file with a default name. I would like to append some text to that default name, but the -O option completely overrides the default name.
For example:
wget www.google.com --> saves as index.html (this is the default name)
wget -O foo.html www.google.com --> saves as foo.html
I would like to save the file as ${default_name}_sometext.html. For the above example, it would look something like index_sometext.html. Any ideas appreciated!
Edit: For those who are wondering why I might need something like this, I have scrapped thousands of URLs for a personal project, and I need to preserve the default name along with some attributes for each file.
#!/bin/bash
domain=google.com
some_text=something
lol=$(wget https://${domain} 2>&1 | grep Saving | cut -d ' ' -f 3 | sed -e's/[^A-Za-z0-9._-]//g')
mv $lol ${some_text}_${lol}
i think it will work for you.
I've got a file that contains a list of file paths. I’m downloading them like this with wget:
wget -i cram_download_list.txt
However the list is long and my session gets interrupted. I’d like to look at the directory for which files already exist, and only download the outstanding ones.
I’ve been trying to com up with an option involving comm, but can’t work out how to loop it in with wget.
File contents look like this:
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239280/NA07037.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239286/NA11829.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239293/NA11918.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239298/NA11994.final.cram
I’m currently trying to do something like this:
ls *.cram | sed 's/^/ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/run\/ERR323\/ERR3239480\//' > downloaded.txt
comm -3 <(sort cram_download_list.txt) <(sort downloaded.txt) | tr -d " \t" > to_download.txt
wget -i to_download_final.txt
I’d like to look at the directory for which files already exist, and
only download the outstanding ones.
To get such behavior you might use -nc (alias --no-clobber) flag. It does skip downloads that would download to existing files (overwriting them). So in your case
wget -nc -i cram_download_list.txt
Beware that this solution does not handle partially downloaded files.
wget -c -i <(find -type f -name '*.cram' -printf '%f$\n' |\
grep -vf - cram_download_list.txt )
Finds files ending in cram and prints them followed by a $ and a newline. This is used as for an inverted regex match list for your download list, i.e. removes any lines ending in the existing file names from your download list.
Added:
-c for finalizing incomplete files (i.e. resume download)
Note: does not handle spaces or newlines in file names well, but these are ftp-URLs so that should not be a problem in the first place.
If you also want to handle partial transferred files, you always need to pass in the complete set of filenames that wget is able to check the length. Which means that for this scenario the only way is:
wget -c -i cram_download_list.txt
The files which are already completed will only be checked and skipped.
I'm trying to write a script that will do a few things in the following order:
cURL websites from a list of urls contained within a "url_list.txt" (new-line delineated) file.
For each website in the list, I want to grep that website looking for keywords contained within a "keywords.txt" (new-line delineated) file.
I want to finish by printing to the terminal in the following format (or something similar):
$URL (that contained match) : $keyword (that made the match)
It needs to be able to run in Ubuntu (GNU grep, etc.)
It does not need to be cURL and grep; as long as the functionality is there.
So far I've got:
#!/bin/bash
keywords=$(cat ./keywords.txt)
urllist=$(cat ./url_list.txt)
for url in $urllist; do
content="$(curl -L -s "$url" | grep -iF "$keywords" /dev/null)"
echo "$content"
done
But for some reason, no matter what I try to tweak or change, it keeps failing to one degree or another.
How can I go about accomplishing this task?
Thanks
Here's how I would do it:
#!/bin/bash
keywords="$(<./keywords.txt)"
while IFS= read -r url; do
curl -L -s "$url" | grep -ioF "$keywords" |
while IFS= read -r keyword; do
echo "$url: $keyword"
done
done < ./url_list.txt
What did I change:
I used $(<./keywords.txt) to read the keywords.txt. This does not rely on an external program (cat in your original script).
I changed the for loop that loops over the url list, into a while loop. This guarentees that we use Θ(1) memory (i.e. we don't have to load the entire url list in memory).
I remove /dev/null from grep. greping from /dev/null alone is meaningless, since it will find nothing there. Instead, I invoke grep with no arguments so that it filters its stdin (which happens to be the output of curl in this case).
I added the -o flag for grep so that it outputs only the matched keyword.
I removed the subshell where you were capturing the output of curl. Instead I run the command directly and feed its output to a while loop. This is necessary because we might get more than keyword match per url.
I'd like to remove all bad tiffs from out of a very large directory. The commandline tool "tiffinfo" makes it easy to identify them:
tiffinfo -D *
This will have an ouput like this:
00074000/74986.TIF: Cannot read TIFF header.
if the tiff file is corrupt. If this happens I'd like to take the file and move it to a different dirrectory: bad_images. I tried using awk on this, but it hasn't worked so far...
Thanks!
Assuming the "Cannot read TIFF header" error comes on standard error, and assuming tiffinfo outputs other data on standard out which you don't want, then:
cd /path/to/tiffs
for file in `tiffinfo -D * 2>&1 >/dev/null | cut -f1 -d:`
do
echo mv $file /path/to/bad_images
done
Remove the echo to actually move the files, once satisfied that the script will work as expected.
I wanted to know how to define the input file for wget, in order to download several files and set a custom name for them, using wget -i filename
Analog example using -O
wget -O customname url
-O filename works only when you give it a single URL.
With multiple URLs, all downloaded content ends up in filename.
You can use a while...loop:
cat urls.txt | while read url
do
wget "$url" -O "${url##*/}" # <-- use custom name here
done