Get Content-Disposition filename without downloading - linux

I'm currently using
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -L -J -O -s
to print the server-imposed filename from a download address but it downloads the file to the directory. How can I get the filename only without downloading the file?

Do a HEAD request in your batch file:
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -I -X HEAD -L -J -O -s
This way, the requested file itself will not be downloaded, as per RFC2616:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. (...) This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself.
Note that this will create two files nonetheless, albeit only with the header data curl has received. I understand that your main interest is not downloading (potentially many and/or big) files, so I guess that should be OK.
Edit: On the follow-up question:
Is there really no way to get only and only the content-disposition filename without downloading anything at all?
It doesn't seem possible with curl alone, I'm afraid. The only way to have curl not output a file is the -o NUL (Windows) routine, and if you do that, %{filename_effective} becomes NUL as well.

curl gives an error when trying to combine -J and -I, so the only solution that I found is to parse the header output with grep and sed:
curl "http://example.com/file.jpg" -LIs | grep ^Content-Disposition | sed -r 's/.*"(.*)".*/\1/'

Related

Get wget to download only new items from a list

I've got a file that contains a list of file paths. I’m downloading them like this with wget:
wget -i cram_download_list.txt
However the list is long and my session gets interrupted. I’d like to look at the directory for which files already exist, and only download the outstanding ones.
I’ve been trying to com up with an option involving comm, but can’t work out how to loop it in with wget.
File contents look like this:
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239280/NA07037.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239286/NA11829.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239293/NA11918.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239298/NA11994.final.cram
I’m currently trying to do something like this:
ls *.cram | sed 's/^/ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/run\/ERR323\/ERR3239480\//' > downloaded.txt
comm -3 <(sort cram_download_list.txt) <(sort downloaded.txt) | tr -d " \t" > to_download.txt
wget -i to_download_final.txt
I’d like to look at the directory for which files already exist, and
only download the outstanding ones.
To get such behavior you might use -nc (alias --no-clobber) flag. It does skip downloads that would download to existing files (overwriting them). So in your case
wget -nc -i cram_download_list.txt
Beware that this solution does not handle partially downloaded files.
wget -c -i <(find -type f -name '*.cram' -printf '%f$\n' |\
grep -vf - cram_download_list.txt )
Finds files ending in cram and prints them followed by a $ and a newline. This is used as for an inverted regex match list for your download list, i.e. removes any lines ending in the existing file names from your download list.
Added:
-c for finalizing incomplete files (i.e. resume download)
Note: does not handle spaces or newlines in file names well, but these are ftp-URLs so that should not be a problem in the first place.
If you also want to handle partial transferred files, you always need to pass in the complete set of filenames that wget is able to check the length. Which means that for this scenario the only way is:
wget -c -i cram_download_list.txt
The files which are already completed will only be checked and skipped.

Getting the size of a remote file in bytes? (without content-size)

I'd like to get the byte size of a remote file as simply as possible.
The issue is that many servers don't send the content-size parameter in their headers these days
curl -I, wget --spider and wget --server-response all give me detailed headers, but nothing about the size of the content.
curl -Is https://wordpress.com | grep content
only returns this:
content-type: text/html; charset=utf-8
So I guess I could work around it like so:
curl -s https://wordpress.com/ > /tmp/foo; du /tmp/foo | awk '{print $1}'
And it does work. But I think it's a little ridiculous that I'm downloading the file itself and writing to my machine.
I imagine there's a better way, where something like curl can get the file size directly from memory, or just get the length of the output to bash in bytes.
The content-size header is not guaranteed in every response. So best you can do might be, to use content-size if available and if not, call curl to download and assess the size of the remote content:
curl -sL 'https://wordpress.com' --write-out '%{size_download}\n' --output /dev/null

Using wget to download images and saving with specified filename

I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg

Using Bash to cURL a website and grep for keywords

I'm trying to write a script that will do a few things in the following order:
cURL websites from a list of urls contained within a "url_list.txt" (new-line delineated) file.
For each website in the list, I want to grep that website looking for keywords contained within a "keywords.txt" (new-line delineated) file.
I want to finish by printing to the terminal in the following format (or something similar):
$URL (that contained match) : $keyword (that made the match)
It needs to be able to run in Ubuntu (GNU grep, etc.)
It does not need to be cURL and grep; as long as the functionality is there.
So far I've got:
#!/bin/bash
keywords=$(cat ./keywords.txt)
urllist=$(cat ./url_list.txt)
for url in $urllist; do
content="$(curl -L -s "$url" | grep -iF "$keywords" /dev/null)"
echo "$content"
done
But for some reason, no matter what I try to tweak or change, it keeps failing to one degree or another.
How can I go about accomplishing this task?
Thanks
Here's how I would do it:
#!/bin/bash
keywords="$(<./keywords.txt)"
while IFS= read -r url; do
curl -L -s "$url" | grep -ioF "$keywords" |
while IFS= read -r keyword; do
echo "$url: $keyword"
done
done < ./url_list.txt
What did I change:
I used $(<./keywords.txt) to read the keywords.txt. This does not rely on an external program (cat in your original script).
I changed the for loop that loops over the url list, into a while loop. This guarentees that we use Θ(1) memory (i.e. we don't have to load the entire url list in memory).
I remove /dev/null from grep. greping from /dev/null alone is meaningless, since it will find nothing there. Instead, I invoke grep with no arguments so that it filters its stdin (which happens to be the output of curl in this case).
I added the -o flag for grep so that it outputs only the matched keyword.
I removed the subshell where you were capturing the output of curl. Instead I run the command directly and feed its output to a while loop. This is necessary because we might get more than keyword match per url.

How can I download and set the filenames using wget -i?

I wanted to know how to define the input file for wget, in order to download several files and set a custom name for them, using wget -i filename
Analog example using -O
wget -O customname url
-O filename works only when you give it a single URL.
With multiple URLs, all downloaded content ends up in filename.
You can use a while...loop:
cat urls.txt | while read url
do
wget "$url" -O "${url##*/}" # <-- use custom name here
done

Resources