Surf all pages of a web link with curl - linux

i use :
curl http://www.alibaba.com/corporations/Electrical_Plugs_%2526_Sockets/CID13--CN------------------50--OR------------BIZ1,BIZ2/30.html | iconv -f windows-1251 | grep -o -h 'data' >>out
to filter data and save to out ,but the link got 67 pages ,how to surf all page of that link and save to out .
Thanks much for any help !

You can use Httrack to download an entire website and then use command line tools to search for specific content locally
http://www.nightbluefruit.com/blog/2010/03/copying-an-entire-website-with-httrack/
Alternatively, you could use the -r recursive switch in wget
http://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html

try with for loop
#!/usr/bin/env bash
url="http://www.alibaba.com/corporations/Electrical_Plugs_%2526_Sockets/CID13--CN------------------50--OR------------BIZ1,BIZ2"
for i in {1..67}
do
curl $url/${i}.html | iconv -f windows-1251 >> out.$i
done

Related

failsafe wget html script? (when user uses ~\.wgetrc)

I just ran into an issue. It was about a code snippet from one of alienbob's script that he uses to check the most recent Adobeflash version on adobe.com:
# Determine the latest version by checking the web page:
VERSION=${VERSION:-"$(wget -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | sed -n "/Firefox - NPAPI/{N;p}" | tr -d ' '| tail -1 | tr '<>' ' ' | cut -f3 -d ' ')"}
echo "Latest version = "$VERSION
That code, in itself usually works like a charm, but not for me. I use a custom ~\.wgetrc, cause I ran into issues with some pages that disallowed wget to make even a single download. Usually I make no mass downloads on any site, unless the sites allows such things, or I set a reasonable pause in my wget script one one-liner.
Now, among other things, the ~\.wgetrc setup masks my wget as a Windows Firefox, and also includes this line:
header = Accept-Encoding: gzip,deflate
And that means, when I use wget to download a html file, it downloads that file as gipped html.
Now I wonder, is there a trick to still make a script like the alienbob one work on such a user setup, or did the user mess up his own system with that setup and has to figure out by himself why the script malfunctions?
(In my case, I could just remove the header = Accept-Encoding line, and all works as it should, since when using wget, one usually not wants html files to be gzipped)
Use
wget --header=Accept-Encoding:identity -O - ....
as the header option will take precedence over .wgetrc option of the same name.
Maybe the target page was redesigned, this is the wget part which works for me now:
wget --header=Accept-Encoding:identity -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | fgrep -m 1 -A 2 "Firefox - NPAPI" | tail -1 | sed s/\</\>/g | cut -d\> -f3

Get Content-Disposition filename without downloading

I'm currently using
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -L -J -O -s
to print the server-imposed filename from a download address but it downloads the file to the directory. How can I get the filename only without downloading the file?
Do a HEAD request in your batch file:
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -I -X HEAD -L -J -O -s
This way, the requested file itself will not be downloaded, as per RFC2616:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. (...) This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself.
Note that this will create two files nonetheless, albeit only with the header data curl has received. I understand that your main interest is not downloading (potentially many and/or big) files, so I guess that should be OK.
Edit: On the follow-up question:
Is there really no way to get only and only the content-disposition filename without downloading anything at all?
It doesn't seem possible with curl alone, I'm afraid. The only way to have curl not output a file is the -o NUL (Windows) routine, and if you do that, %{filename_effective} becomes NUL as well.
curl gives an error when trying to combine -J and -I, so the only solution that I found is to parse the header output with grep and sed:
curl "http://example.com/file.jpg" -LIs | grep ^Content-Disposition | sed -r 's/.*"(.*)".*/\1/'

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Microsoft has provided a URL where you can download the public IP ranges used by Microsoft Azure.
https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653
The question is, is there a way to download the XML file from that site automatically using Python or other scripts? I am trying to schedule a task to grab the new IP range file and process it for my firewall policies.
Yes! There is!
But of course you need to do a bit more work than expected to achieve this.
I discovered this question when I realized Microsoft’s page only works via JavaScript when loaded in a browser. Which makes automation via Curl practically unusable; I’m not manually downloading this file when updates happen. Instead IO came up with this Bash “one-liner” that works well for me.
First, let’s define that base URL like this:
MICROSOFT_IP_RANGES_URL="https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653"
Now just run this Curl command—that parses the returned web page HTML with a few chained Greps—like this:
curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+'
The returned output is a nice and clean final URL like this:
https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20181224.xml
Which is exactly what one needs to automatic the direct download of that XML file. If you need to then assign that value to a URL just wrap it in $() like this:
MICROSOFT_IP_RANGES_URL_FINAL=$(curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+')
And just access that URL via $MICROSOFT_IP_RANGES_URL_FINAL and there you go.
if you can use wget writing in a text editor and saving it as scriptNameYouWant.sh will give you a bash script that will download the xml file you want.
#! /bin/bash
wget http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml
Of course, you can run the command directly from a terminal and have the exact result.
If you want to use Python, one way is:
Again, write in a text editor and save as scriptNameYouWant.py the folllowing:
import urllib2
page = urllib2.urlopen("http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml").read()
f = open("xmlNameYouWant.xml", "w")
f.write(str(page))
f.close()
I'm not that good with python or bash though, so I guess there is a more elegant way in both python or bash.
EDIT
You get the xml despite the dynamic url using curl. What worked for me is this:
#! /bin/bash
FIRSTLONGPART="http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_"
INITIALURL="http://www.microsoft.com/EN-US/DOWNLOAD/confirmation.aspx?id=41653"
OUT="$(curl -s $INITIALURL | grep -o -P '(?<='$FIRSTLONGPART').*(?=.xml")'|tail -1)"
wget -nv $FIRSTLONGPART$OUT".xml"
Again, I'm sure that there is a more elegant way of doing this.
Here's an one-liner to get the URL of the JSON file:
$ curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq
I enhanced the last response a bit and I'm able to download the JSON file:
#! /bin/bash
download_link=$(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq | grep -v refresh)
if [ $? -eq 0 ]
then
wget $download_link ; echo "Latest file downloaded"
else
echo "Download failed"
fi
Slight mod to ushuz script from 2020 - as that was producing 2 different entries and unique doesn't work for me in Windows -
$list=#(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?.json')
$list.Get(2)
That outputs the valid path to the json, which you can punch into another variable for use for grepping or whatever.

Unix lynx in shell script to input data into website and grep result

Pretty new to unix shell scripting here, and I have some other examples to look at but still trying from almost scratch. I'm trying to track deliveries for our company, and I have a script I want to run that will input the tracking number into the website, and then grep the result to a file (delivered/not delivered). I can use the lynx command to get to the website at the command line and see the results, but in the script it just returns the webpage, and doesn't enter the tracking number.
Here's the code I have tried that works up to this point:
#$1 = 1034548607
FNAME=`date +%y%m%d%H%M%S`
echo requiredmcpartno=$1 | lynx -accept_all_cookies -nolist -dump -post_data http://apps.yrcregional.com/shipmentStatus/track.do 2>&1 | tee $FNAME >/home/jschroff/log.lg
DLV=`grep "PRO" $FNAME | cut --delimiter=: --fields=2 | awk '{print $DLV}'`
echo $1 $DLV > log.txt
rm $FNAME
I'm trying to get the results for the tracking number(PRO number as they call it) 1034548607.
Try doing this with curl :
trackNumber=1234
curl -A Mozilla/5.0 -b cookies -c cookies -kLd "proNumber=$trackNumber" http://apps.yrcregional.com/shipmentStatus/track.do
But verify the TOS to know if you are authorized to scrape this web site.
If you want to parse the output, give us a sample HTML output.

How can I download and set the filenames using wget -i?

I wanted to know how to define the input file for wget, in order to download several files and set a custom name for them, using wget -i filename
Analog example using -O
wget -O customname url
-O filename works only when you give it a single URL.
With multiple URLs, all downloaded content ends up in filename.
You can use a while...loop:
cat urls.txt | while read url
do
wget "$url" -O "${url##*/}" # <-- use custom name here
done

Resources