Getting the size of a remote file in bytes? (without content-size)

Getting the size of a remote file in bytes? (without content-size) - linux

I'd like to get the byte size of a remote file as simply as possible.
The issue is that many servers don't send the content-size parameter in their headers these days
curl -I, wget --spider and wget --server-response all give me detailed headers, but nothing about the size of the content.
curl -Is https://wordpress.com | grep content
only returns this:
content-type: text/html; charset=utf-8
So I guess I could work around it like so:
curl -s https://wordpress.com/ > /tmp/foo; du /tmp/foo | awk '{print $1}'
And it does work. But I think it's a little ridiculous that I'm downloading the file itself and writing to my machine.
I imagine there's a better way, where something like curl can get the file size directly from memory, or just get the length of the output to bash in bytes.

The content-size header is not guaranteed in every response. So best you can do might be, to use content-size if available and if not, call curl to download and assess the size of the remote content:
curl -sL 'https://wordpress.com' --write-out '%{size_download}\n' --output /dev/null

Related

Linux print file and use it in for example a curl request

This might be a complete noob question, but I would like to know a simple oneliner to solve my problem.
Imagine following: I have 5 files, could be .txt, .sql .anything.
root#DESKTOP:/$ ls
test.sql test1.sql test2.sql test3.sql
Imagine I would like to use for example test.sql in a curl request as part of a json parameter.
root#DESKTOP:/$ cat test.sql |
curl -H "Accept: application/json"
-H "Content-type: application/json"
-X POST -d "{ \"sqlcommand\" :\"??????" }"
http://localhost:9999/api/testsql
How can I put the output of the cat command in this request at the place of the questionmark? $0, $1 etc are not sufficient. I know how to do it with a for loop. And I could write a shell script that takes an input parameter that I could paste in the command. But. I would like to do it in a simple oneliner and moreover I'd like to learn how I can get the output of the previous command when I need to use it combined with other data OR when it is bad practice i'd like to know the best practice.
Thank you in advance!

This should do what you need:
curl -X POST -d "{ \"sqlcommand\" :\"$(cat test.sql)\" }"
$(cmd) substitutes the result of cmd as a string

failsafe wget html script? (when user uses ~\.wgetrc)

I just ran into an issue. It was about a code snippet from one of alienbob's script that he uses to check the most recent Adobeflash version on adobe.com:
# Determine the latest version by checking the web page:
VERSION=${VERSION:-"$(wget -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | sed -n "/Firefox - NPAPI/{N;p}" | tr -d ' '| tail -1 | tr '<>' ' ' | cut -f3 -d ' ')"}
echo "Latest version = "$VERSION
That code, in itself usually works like a charm, but not for me. I use a custom ~\.wgetrc, cause I ran into issues with some pages that disallowed wget to make even a single download. Usually I make no mass downloads on any site, unless the sites allows such things, or I set a reasonable pause in my wget script one one-liner.
Now, among other things, the ~\.wgetrc setup masks my wget as a Windows Firefox, and also includes this line:
header = Accept-Encoding: gzip,deflate
And that means, when I use wget to download a html file, it downloads that file as gipped html.
Now I wonder, is there a trick to still make a script like the alienbob one work on such a user setup, or did the user mess up his own system with that setup and has to figure out by himself why the script malfunctions?
(In my case, I could just remove the header = Accept-Encoding line, and all works as it should, since when using wget, one usually not wants html files to be gzipped)

Use
wget --header=Accept-Encoding:identity -O - ....
as the header option will take precedence over .wgetrc option of the same name.
Maybe the target page was redesigned, this is the wget part which works for me now:
wget --header=Accept-Encoding:identity -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | fgrep -m 1 -A 2 "Firefox - NPAPI" | tail -1 | sed s/\</\>/g | cut -d\> -f3

Get Content-Disposition filename without downloading

I'm currently using
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -L -J -O -s
to print the server-imposed filename from a download address but it downloads the file to the directory. How can I get the filename only without downloading the file?

Do a HEAD request in your batch file:
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -I -X HEAD -L -J -O -s
This way, the requested file itself will not be downloaded, as per RFC2616:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. (...) This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself.
Note that this will create two files nonetheless, albeit only with the header data curl has received. I understand that your main interest is not downloading (potentially many and/or big) files, so I guess that should be OK.
Edit: On the follow-up question:
Is there really no way to get only and only the content-disposition filename without downloading anything at all?
It doesn't seem possible with curl alone, I'm afraid. The only way to have curl not output a file is the -o NUL (Windows) routine, and if you do that, %{filename_effective} becomes NUL as well.

curl gives an error when trying to combine -J and -I, so the only solution that I found is to parse the header output with grep and sed:
curl "http://example.com/file.jpg" -LIs | grep ^Content-Disposition | sed -r 's/.*"(.*)".*/\1/'

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Microsoft has provided a URL where you can download the public IP ranges used by Microsoft Azure.
https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653
The question is, is there a way to download the XML file from that site automatically using Python or other scripts? I am trying to schedule a task to grab the new IP range file and process it for my firewall policies.

Yes! There is!
But of course you need to do a bit more work than expected to achieve this.
I discovered this question when I realized Microsoft’s page only works via JavaScript when loaded in a browser. Which makes automation via Curl practically unusable; I’m not manually downloading this file when updates happen. Instead IO came up with this Bash “one-liner” that works well for me.
First, let’s define that base URL like this:
MICROSOFT_IP_RANGES_URL="https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653"
Now just run this Curl command—that parses the returned web page HTML with a few chained Greps—like this:
curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+'
The returned output is a nice and clean final URL like this:
https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20181224.xml
Which is exactly what one needs to automatic the direct download of that XML file. If you need to then assign that value to a URL just wrap it in $() like this:
MICROSOFT_IP_RANGES_URL_FINAL=$(curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+')
And just access that URL via $MICROSOFT_IP_RANGES_URL_FINAL and there you go.

if you can use wget writing in a text editor and saving it as scriptNameYouWant.sh will give you a bash script that will download the xml file you want.
#! /bin/bash
wget http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml
Of course, you can run the command directly from a terminal and have the exact result.
If you want to use Python, one way is:
Again, write in a text editor and save as scriptNameYouWant.py the folllowing:
import urllib2
page = urllib2.urlopen("http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml").read()
f = open("xmlNameYouWant.xml", "w")
f.write(str(page))
f.close()
I'm not that good with python or bash though, so I guess there is a more elegant way in both python or bash.
EDIT
You get the xml despite the dynamic url using curl. What worked for me is this:
#! /bin/bash
FIRSTLONGPART="http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_"
INITIALURL="http://www.microsoft.com/EN-US/DOWNLOAD/confirmation.aspx?id=41653"
OUT="$(curl -s $INITIALURL | grep -o -P '(?<='$FIRSTLONGPART').*(?=.xml")'|tail -1)"
wget -nv $FIRSTLONGPART$OUT".xml"
Again, I'm sure that there is a more elegant way of doing this.

Here's an one-liner to get the URL of the JSON file:
$ curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq

I enhanced the last response a bit and I'm able to download the JSON file:
#! /bin/bash
download_link=$(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq | grep -v refresh)
if [ $? -eq 0 ]
then
wget $download_link ; echo "Latest file downloaded"
else
echo "Download failed"
fi

Slight mod to ushuz script from 2020 - as that was producing 2 different entries and unique doesn't work for me in Windows -
$list=#(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?.json')
$list.Get(2)
That outputs the valid path to the json, which you can punch into another variable for use for grepping or whatever.

Surf all pages of a web link with curl

i use :
curl http://www.alibaba.com/corporations/Electrical_Plugs_%2526_Sockets/CID13--CN------------------50--OR------------BIZ1,BIZ2/30.html | iconv -f windows-1251 | grep -o -h 'data' >>out
to filter data and save to out ,but the link got 67 pages ,how to surf all page of that link and save to out .
Thanks much for any help !

You can use Httrack to download an entire website and then use command line tools to search for specific content locally
http://www.nightbluefruit.com/blog/2010/03/copying-an-entire-website-with-httrack/
Alternatively, you could use the -r recursive switch in wget
http://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html

try with for loop
#!/usr/bin/env bash
url="http://www.alibaba.com/corporations/Electrical_Plugs_%2526_Sockets/CID13--CN------------------50--OR------------BIZ1,BIZ2"
for i in {1..67}
do
curl $url/${i}.html | iconv -f windows-1251 >> out.$i
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting the size of a remote file in bytes? (without content-size) - linux

The content-size header is not guaranteed in every response. So best you can do might be, to use content-size if available and if not, call curl to download and assess the size of the remote content: curl -sL 'https://wordpress.com' --write-out '%{size_download}\n' --output /dev/null

Related

Linux print file and use it in for example a curl request

failsafe wget html script? (when user uses ~\.wgetrc)

Get Content-Disposition filename without downloading

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Surf all pages of a web link with curl

Categories

Resources