Unix lynx in shell script to input data into website and grep result - linux

Pretty new to unix shell scripting here, and I have some other examples to look at but still trying from almost scratch. I'm trying to track deliveries for our company, and I have a script I want to run that will input the tracking number into the website, and then grep the result to a file (delivered/not delivered). I can use the lynx command to get to the website at the command line and see the results, but in the script it just returns the webpage, and doesn't enter the tracking number.
Here's the code I have tried that works up to this point:
#$1 = 1034548607
FNAME=`date +%y%m%d%H%M%S`
echo requiredmcpartno=$1 | lynx -accept_all_cookies -nolist -dump -post_data http://apps.yrcregional.com/shipmentStatus/track.do 2>&1 | tee $FNAME >/home/jschroff/log.lg
DLV=`grep "PRO" $FNAME | cut --delimiter=: --fields=2 | awk '{print $DLV}'`
echo $1 $DLV > log.txt
rm $FNAME
I'm trying to get the results for the tracking number(PRO number as they call it) 1034548607.

Try doing this with curl :
trackNumber=1234
curl -A Mozilla/5.0 -b cookies -c cookies -kLd "proNumber=$trackNumber" http://apps.yrcregional.com/shipmentStatus/track.do
But verify the TOS to know if you are authorized to scrape this web site.
If you want to parse the output, give us a sample HTML output.

Related

How to download URLs from the website and save them in a file (wget, curl)?

How to use WGET to separate the marked links from this side?
Can this be done with CURL?
I want to download URLs from this page and save them in a file.
I tried like that.
wget -r -p -k https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4
Link Gopher separated the addresses.
EDITION.
How can I download addresses to the file from the terminal?
Can it be done with the help of WGET?
Can it be done with the help of CURL?
I want to download addresses from this page and save them to the file.
I want to save these links.
`
https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4
https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2985/e15e664718ef6c0dba471d59c4a1928a
https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2986/58edb8e0f06dc3da40c255e50b3839cf
`
Edition 1.
You will need to use something like
Download Serialized DOM
I added that to my Firefox browser and it works, although it is a bit slow, and the only time you know it is completed is when the *.html.part file disappears for the corresponding *.html file which you will save using the Add-on button.
Basically, that will save the complete web page (excluding binaries, i.e. images, videos, etc.) as a single text file.
Also, only while saving these files, the developper indicates there is a bug for which you MUST allow "Use in private mode" to circumvent the bug.
Here is a fragment of the full season 44 index page displayed (note the address in the address bar):
Since I don't have your access I can't reproduce, but the service is hiding from me the page of the individual video (what you get when you click on a picture) because I don't have login access. They give me the index instead of the address in the address bar (their security processes at work). However the index page should probably show something different after the ".../sezon-44/5027472/" .
Using that saved DOM file as input, the following will extract the necessary references:
#!/bin/sh
###
### LOGIC FLOW => CONFIRMED VALID
###
DBG=1
#URL="https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4"
###
### Completely expanded and populated DOM file,
### as captured by Firefox extension "Download Serialized DOM"
###
### Extension is slow but apparently very functional.
###
INPUT="test_77_Serialized.html"
BASE=$(basename "$0" ".sh")
TMP="${BASE}.tmp"
HARVESTED="${BASE}.harvest"
DISTILLED="${BASE}.urls"
#if [ ! -s "${TMP}" ]
#then
# ### Non-serialized
# wget -O "${TMP}" "${URL}"
#fi
### Each 'more' step is to allow review of outputs to identify patterns which are to be used for the next step.
cp -p ${INPUT} "${TMP}"
test ${DBG} -eq 1 && more ${TMP}
sed 's+\<a\ +\n\<a\ +g' "${TMP}" >"${TMP}.2"
URL_BASE=$( grep 'tiba=' ${TMP}.2 |
sed 's+tiba=+\ntiba=+' |
grep -v 'viewport' |
cut -f1 -d\; |
cut -f2 -d\= |
cut -f1 -d\% )
echo "\n=======================\n${URL_BASE}\n=======================\n"
sed 's+\<a\ +\n\<a\ +g' "${TMP}" | grep '<a ' >"${TMP}.2"
test ${DBG} -eq 1 && more ${TMP}.2
grep 'title="Pierwsza Miłość - Odcinek' "${TMP}.2" >"${TMP}.3"
test ${DBG} -eq 1 && more ${TMP}.3
### FORMAT: Typical entry identified for video files
#<a data-testing="list.item.0" title="Pierwsza Miłość - Odcinek 2984" href="/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4" class="ifodj3-0 yIiYl"></a><div class="sc-1vdpbg2-2 hKhMfx"><img data-src="https://ipla.pluscdn.pl/p/vm2images/9x/9xzfengehrm1rm8ukf7cvzvypv175iin.jpg" alt="Pierwsza Miłość - Odcinek 2984" class="rebvib-0 hIBTLi" src="https://ipla.pluscdn.pl/p/vm2images/9x/9xzfengehrm1rm8ukf7cvzvypv175iin.jpg"></div><div class="sc-1i4o84g-2 iDuLtn"><div class="orrg5d-0 gBnmbk"><span class="orrg5d-1 AjaSg">Odcinek 2984</span></div></div></div></div><div class="sc-1vdpbg2-1 bBDzBS"><div class="sc-1vdpbg2-0 hWnUTt"><
sed 's+href=+\nhref=+' "${TMP}.3" |
sed 's+class=+\nclass=+' |
grep '^href=' >"${TMP}.4"
test ${DBG} -eq 1 && more ${TMP}.4
awk -v base="${URL_BASE}" -v splitter=\" '{
printf("https://%s", base ) ;
pos=index( $0, "href=" ) ;
if( pos != 0 ){
rem=substr( $0, pos+6 ) ;
n=split( rem, var, splitter) ;
printf("%s\n", var[1] ) ;
} ;
}' "${TMP}.4" >${TMP}.5
more ${TMP}.5
exit
That will give you a report for ${TMP}.5 like this:
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2985/e15e664718ef6c0dba471d59c4a1928a
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2986/58edb8e0f06dc3da40c255e50b3839cf
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2987/2ebc2e7b13268e74d90cc64c898530ee
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2988/2031529377d3be27402f61f07c1cd4f4
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2989/eaceb96a0368da10fb64e1383f93f513
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2990/4974094499083a8d67158d51c5df2fcb
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2991/4c79d87656dcafcccd4dfd9349ca7c23
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2992/26b4d8808ef4851640b9a2dfa8499a6d
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2993/930aaa5b2b3d52e2367dd4f533728020
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2994/fa78c186bc9414f844f197fd2d673da3
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2995/c059c7b2b54c3c25996c02992228e46b
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2996/4a016aeed0ee5b7ed5ae1c6117347e6a
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2997/1e3dca41d84471d5d95579afee66c6cf
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2998/440d069159114621939d1627eda37aec
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2999/f54381d4b61f76bb83f072059c15ea84
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3000/b272901a616147cd9f570750aa450f99
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3001/3aca6bd8e81962dc4a45fcc586cdcc7f
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3002/c6500c6e261bd5d65d0bd3a57cd36288
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3003/35a13bc5e5570ed223c5a0221a8d13f3
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3004/a5cfb71ed30e704730b8891323ff7d92
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3005/d86c1308029d78a6b7090503f8bab88e
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3006/54bba327bc7a1ae7b9b609e7ee11c07c
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3007/17d199a0523df8430bcb1f21d4a5b573
NOTE: In the image below, the icon between the "folder" and the "star", in the address bar of that image, is the button for the Download Serialized DOM extension to capture the currently displayed page as a fully-instantiated DOM file.
To save the output of the wget command that you provided above, add the following at the end of your command line:
-O ${vidfileUniqueName}.${fileTypeSuffix}
Before that wget, you will need to define something like the following:
vidfileUniqueName=$(echo "${URL}" | cut -f10 -d\/ )
fileTypeSuffix="mp4|avi|mkv"
You need to choose only one of the suffix types from that list and remove the others.

failsafe wget html script? (when user uses ~\.wgetrc)

I just ran into an issue. It was about a code snippet from one of alienbob's script that he uses to check the most recent Adobeflash version on adobe.com:
# Determine the latest version by checking the web page:
VERSION=${VERSION:-"$(wget -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | sed -n "/Firefox - NPAPI/{N;p}" | tr -d ' '| tail -1 | tr '<>' ' ' | cut -f3 -d ' ')"}
echo "Latest version = "$VERSION
That code, in itself usually works like a charm, but not for me. I use a custom ~\.wgetrc, cause I ran into issues with some pages that disallowed wget to make even a single download. Usually I make no mass downloads on any site, unless the sites allows such things, or I set a reasonable pause in my wget script one one-liner.
Now, among other things, the ~\.wgetrc setup masks my wget as a Windows Firefox, and also includes this line:
header = Accept-Encoding: gzip,deflate
And that means, when I use wget to download a html file, it downloads that file as gipped html.
Now I wonder, is there a trick to still make a script like the alienbob one work on such a user setup, or did the user mess up his own system with that setup and has to figure out by himself why the script malfunctions?
(In my case, I could just remove the header = Accept-Encoding line, and all works as it should, since when using wget, one usually not wants html files to be gzipped)
Use
wget --header=Accept-Encoding:identity -O - ....
as the header option will take precedence over .wgetrc option of the same name.
Maybe the target page was redesigned, this is the wget part which works for me now:
wget --header=Accept-Encoding:identity -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | fgrep -m 1 -A 2 "Firefox - NPAPI" | tail -1 | sed s/\</\>/g | cut -d\> -f3

Using Bash to cURL a website and grep for keywords

I'm trying to write a script that will do a few things in the following order:
cURL websites from a list of urls contained within a "url_list.txt" (new-line delineated) file.
For each website in the list, I want to grep that website looking for keywords contained within a "keywords.txt" (new-line delineated) file.
I want to finish by printing to the terminal in the following format (or something similar):
$URL (that contained match) : $keyword (that made the match)
It needs to be able to run in Ubuntu (GNU grep, etc.)
It does not need to be cURL and grep; as long as the functionality is there.
So far I've got:
#!/bin/bash
keywords=$(cat ./keywords.txt)
urllist=$(cat ./url_list.txt)
for url in $urllist; do
content="$(curl -L -s "$url" | grep -iF "$keywords" /dev/null)"
echo "$content"
done
But for some reason, no matter what I try to tweak or change, it keeps failing to one degree or another.
How can I go about accomplishing this task?
Thanks
Here's how I would do it:
#!/bin/bash
keywords="$(<./keywords.txt)"
while IFS= read -r url; do
curl -L -s "$url" | grep -ioF "$keywords" |
while IFS= read -r keyword; do
echo "$url: $keyword"
done
done < ./url_list.txt
What did I change:
I used $(<./keywords.txt) to read the keywords.txt. This does not rely on an external program (cat in your original script).
I changed the for loop that loops over the url list, into a while loop. This guarentees that we use Θ(1) memory (i.e. we don't have to load the entire url list in memory).
I remove /dev/null from grep. greping from /dev/null alone is meaningless, since it will find nothing there. Instead, I invoke grep with no arguments so that it filters its stdin (which happens to be the output of curl in this case).
I added the -o flag for grep so that it outputs only the matched keyword.
I removed the subshell where you were capturing the output of curl. Instead I run the command directly and feed its output to a while loop. This is necessary because we might get more than keyword match per url.

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Microsoft has provided a URL where you can download the public IP ranges used by Microsoft Azure.
https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653
The question is, is there a way to download the XML file from that site automatically using Python or other scripts? I am trying to schedule a task to grab the new IP range file and process it for my firewall policies.
Yes! There is!
But of course you need to do a bit more work than expected to achieve this.
I discovered this question when I realized Microsoft’s page only works via JavaScript when loaded in a browser. Which makes automation via Curl practically unusable; I’m not manually downloading this file when updates happen. Instead IO came up with this Bash “one-liner” that works well for me.
First, let’s define that base URL like this:
MICROSOFT_IP_RANGES_URL="https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653"
Now just run this Curl command—that parses the returned web page HTML with a few chained Greps—like this:
curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+'
The returned output is a nice and clean final URL like this:
https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20181224.xml
Which is exactly what one needs to automatic the direct download of that XML file. If you need to then assign that value to a URL just wrap it in $() like this:
MICROSOFT_IP_RANGES_URL_FINAL=$(curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+')
And just access that URL via $MICROSOFT_IP_RANGES_URL_FINAL and there you go.
if you can use wget writing in a text editor and saving it as scriptNameYouWant.sh will give you a bash script that will download the xml file you want.
#! /bin/bash
wget http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml
Of course, you can run the command directly from a terminal and have the exact result.
If you want to use Python, one way is:
Again, write in a text editor and save as scriptNameYouWant.py the folllowing:
import urllib2
page = urllib2.urlopen("http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml").read()
f = open("xmlNameYouWant.xml", "w")
f.write(str(page))
f.close()
I'm not that good with python or bash though, so I guess there is a more elegant way in both python or bash.
EDIT
You get the xml despite the dynamic url using curl. What worked for me is this:
#! /bin/bash
FIRSTLONGPART="http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_"
INITIALURL="http://www.microsoft.com/EN-US/DOWNLOAD/confirmation.aspx?id=41653"
OUT="$(curl -s $INITIALURL | grep -o -P '(?<='$FIRSTLONGPART').*(?=.xml")'|tail -1)"
wget -nv $FIRSTLONGPART$OUT".xml"
Again, I'm sure that there is a more elegant way of doing this.
Here's an one-liner to get the URL of the JSON file:
$ curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq
I enhanced the last response a bit and I'm able to download the JSON file:
#! /bin/bash
download_link=$(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq | grep -v refresh)
if [ $? -eq 0 ]
then
wget $download_link ; echo "Latest file downloaded"
else
echo "Download failed"
fi
Slight mod to ushuz script from 2020 - as that was producing 2 different entries and unique doesn't work for me in Windows -
$list=#(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?.json')
$list.Get(2)
That outputs the valid path to the json, which you can punch into another variable for use for grepping or whatever.

How can I extract the text of multiple URLs with lynx/w3m in Linux

I have made a list of 50 odd URLs in one text file (one URL per each line). Now, for each URL I want to extract the text of the web site and save it down. This sounds like a job for a shell script in Linux.
At the moment I am putting things together:
with say sed -n 1p listofurls.txt I could read the first line in my URL file, listofurls.txt
with lynx -dump www.firsturl... I can use the output for piping through various commands to tidy and clean it up. Done, that works.
Before automating, I am struggling to piping in the URLs into lynx: say
sed -n 1p listofurls.txt | lynx -dump -stdin
does not work.
How can I do that say for one URL, and more importantly for each URL I have in listofurls.txt?
You can write script like this
vi script.sh
#content of script.sh#
while read line
do
name=$line
wget $name
echo "Downloaded content from - $name"
done < $1
#end#
chmod 777 script.sh
./script.sh listofurls.txt
To pipe one URL to lynx, you can can use xargs:
sed -n 1p listofurls.txt | xargs lynx -dump
To download all URLs from the file (parse them by lynx and just print out), you can do:
while read url; do lynx - -dump $url; done < listofurls.txt

Resources