failsafe wget html script? (when user uses ~\.wgetrc) - linux

I just ran into an issue. It was about a code snippet from one of alienbob's script that he uses to check the most recent Adobeflash version on adobe.com:
# Determine the latest version by checking the web page:
VERSION=${VERSION:-"$(wget -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | sed -n "/Firefox - NPAPI/{N;p}" | tr -d ' '| tail -1 | tr '<>' ' ' | cut -f3 -d ' ')"}
echo "Latest version = "$VERSION
That code, in itself usually works like a charm, but not for me. I use a custom ~\.wgetrc, cause I ran into issues with some pages that disallowed wget to make even a single download. Usually I make no mass downloads on any site, unless the sites allows such things, or I set a reasonable pause in my wget script one one-liner.
Now, among other things, the ~\.wgetrc setup masks my wget as a Windows Firefox, and also includes this line:
header = Accept-Encoding: gzip,deflate
And that means, when I use wget to download a html file, it downloads that file as gipped html.
Now I wonder, is there a trick to still make a script like the alienbob one work on such a user setup, or did the user mess up his own system with that setup and has to figure out by himself why the script malfunctions?
(In my case, I could just remove the header = Accept-Encoding line, and all works as it should, since when using wget, one usually not wants html files to be gzipped)

Use
wget --header=Accept-Encoding:identity -O - ....
as the header option will take precedence over .wgetrc option of the same name.
Maybe the target page was redesigned, this is the wget part which works for me now:
wget --header=Accept-Encoding:identity -O - http://www.adobe.com/software/flash/about/ 2>/dev/null | fgrep -m 1 -A 2 "Firefox - NPAPI" | tail -1 | sed s/\</\>/g | cut -d\> -f3

Related

Trimming string up to certain characters in Bash

I'm trying to make a bash script that will tell me the latest stable version of the Linux kernel.
The problem is that, while I can remove everything after certain characters, I don't seem to be able to delete everything prior to certain characters.
#!/bin/bash
wget=$(wget --output-document - --quiet www.kernel.org | \grep -A 1 "latest_link")
wget=${wget##.tar.xz\">}
wget=${wget%</a>}
echo "${wget}"
Somehow the output "ignores" the wget=${wget##.tar.xz\">} line.
You're trying remove the longest match of the pattern .tar.xz\"> from the beginning of the string, but your string doesn't start with .tar.xz, so there is no match.
You have to use
wget=${wget##*.tar.xz\">}
Then, because you're in a script and not an interactive shell, there shouldn't be any need to escape \grep (presumably to prevent usage of an alias), as aliases are disabled in non-interactive shells.
And, as pointed out, naming a variable the same as an existing command (often found: test) is bound to lead to confusion.
If you want to use command line tools designed to deal with HTML, you could have a look at the W3C HTML-XML-utils (Ubuntu: apt install html-xml-utils). Using them, you could get the info you want as follows:
$ curl -sL www.kernel.org | hxselect 'td#latest_link' | hxextract a -
4.10.8
Or, in detail:
curl -sL www.kernel.org | # Fetch page
hxselect 'td#latest_link' | # Select td element with ID "latest_link"
hxextract a - # Extract link text ("-" for standard input)
Whenever I need to extract a substring in bash I always see if I can brute force it in a couple of cut(1) commands. In your case, the following appears to work:
wget=$(wget --output-document - --quiet www.kernel.org | \grep -A 1 "latest_link")
echo $wget | cut -d'>' -f3 | cut -d'<' -f1
I'm certain there's a more elegant way, but this has simple syntax that I never forget. Note that it will break if 'wget' gets extra ">" or "<" characters in the future.
It is not recommended to use shell tools grep, awk, sed etc to parse HTML files.
However if you want a quick one liner then this awk should do the job:
get --output-document - --quiet www.kernel.org |
awk '/"latest_link"/ { getline; n=split($0, a, /[<>]/); print a[n-2] }'
4.10.8
sed method:
wget --output-document - --quiet www.kernel.org | \
sed -n '/latest_link/{n;s/^.*">//;s/<.*//p}'
Output:
4.10.8

Using Bash to cURL a website and grep for keywords

I'm trying to write a script that will do a few things in the following order:
cURL websites from a list of urls contained within a "url_list.txt" (new-line delineated) file.
For each website in the list, I want to grep that website looking for keywords contained within a "keywords.txt" (new-line delineated) file.
I want to finish by printing to the terminal in the following format (or something similar):
$URL (that contained match) : $keyword (that made the match)
It needs to be able to run in Ubuntu (GNU grep, etc.)
It does not need to be cURL and grep; as long as the functionality is there.
So far I've got:
#!/bin/bash
keywords=$(cat ./keywords.txt)
urllist=$(cat ./url_list.txt)
for url in $urllist; do
content="$(curl -L -s "$url" | grep -iF "$keywords" /dev/null)"
echo "$content"
done
But for some reason, no matter what I try to tweak or change, it keeps failing to one degree or another.
How can I go about accomplishing this task?
Thanks
Here's how I would do it:
#!/bin/bash
keywords="$(<./keywords.txt)"
while IFS= read -r url; do
curl -L -s "$url" | grep -ioF "$keywords" |
while IFS= read -r keyword; do
echo "$url: $keyword"
done
done < ./url_list.txt
What did I change:
I used $(<./keywords.txt) to read the keywords.txt. This does not rely on an external program (cat in your original script).
I changed the for loop that loops over the url list, into a while loop. This guarentees that we use Θ(1) memory (i.e. we don't have to load the entire url list in memory).
I remove /dev/null from grep. greping from /dev/null alone is meaningless, since it will find nothing there. Instead, I invoke grep with no arguments so that it filters its stdin (which happens to be the output of curl in this case).
I added the -o flag for grep so that it outputs only the matched keyword.
I removed the subshell where you were capturing the output of curl. Instead I run the command directly and feed its output to a while loop. This is necessary because we might get more than keyword match per url.

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Microsoft has provided a URL where you can download the public IP ranges used by Microsoft Azure.
https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653
The question is, is there a way to download the XML file from that site automatically using Python or other scripts? I am trying to schedule a task to grab the new IP range file and process it for my firewall policies.
Yes! There is!
But of course you need to do a bit more work than expected to achieve this.
I discovered this question when I realized Microsoft’s page only works via JavaScript when loaded in a browser. Which makes automation via Curl practically unusable; I’m not manually downloading this file when updates happen. Instead IO came up with this Bash “one-liner” that works well for me.
First, let’s define that base URL like this:
MICROSOFT_IP_RANGES_URL="https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653"
Now just run this Curl command—that parses the returned web page HTML with a few chained Greps—like this:
curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+'
The returned output is a nice and clean final URL like this:
https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20181224.xml
Which is exactly what one needs to automatic the direct download of that XML file. If you need to then assign that value to a URL just wrap it in $() like this:
MICROSOFT_IP_RANGES_URL_FINAL=$(curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+')
And just access that URL via $MICROSOFT_IP_RANGES_URL_FINAL and there you go.
if you can use wget writing in a text editor and saving it as scriptNameYouWant.sh will give you a bash script that will download the xml file you want.
#! /bin/bash
wget http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml
Of course, you can run the command directly from a terminal and have the exact result.
If you want to use Python, one way is:
Again, write in a text editor and save as scriptNameYouWant.py the folllowing:
import urllib2
page = urllib2.urlopen("http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml").read()
f = open("xmlNameYouWant.xml", "w")
f.write(str(page))
f.close()
I'm not that good with python or bash though, so I guess there is a more elegant way in both python or bash.
EDIT
You get the xml despite the dynamic url using curl. What worked for me is this:
#! /bin/bash
FIRSTLONGPART="http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_"
INITIALURL="http://www.microsoft.com/EN-US/DOWNLOAD/confirmation.aspx?id=41653"
OUT="$(curl -s $INITIALURL | grep -o -P '(?<='$FIRSTLONGPART').*(?=.xml")'|tail -1)"
wget -nv $FIRSTLONGPART$OUT".xml"
Again, I'm sure that there is a more elegant way of doing this.
Here's an one-liner to get the URL of the JSON file:
$ curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq
I enhanced the last response a bit and I'm able to download the JSON file:
#! /bin/bash
download_link=$(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq | grep -v refresh)
if [ $? -eq 0 ]
then
wget $download_link ; echo "Latest file downloaded"
else
echo "Download failed"
fi
Slight mod to ushuz script from 2020 - as that was producing 2 different entries and unique doesn't work for me in Windows -
$list=#(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?.json')
$list.Get(2)
That outputs the valid path to the json, which you can punch into another variable for use for grepping or whatever.

Unix lynx in shell script to input data into website and grep result

Pretty new to unix shell scripting here, and I have some other examples to look at but still trying from almost scratch. I'm trying to track deliveries for our company, and I have a script I want to run that will input the tracking number into the website, and then grep the result to a file (delivered/not delivered). I can use the lynx command to get to the website at the command line and see the results, but in the script it just returns the webpage, and doesn't enter the tracking number.
Here's the code I have tried that works up to this point:
#$1 = 1034548607
FNAME=`date +%y%m%d%H%M%S`
echo requiredmcpartno=$1 | lynx -accept_all_cookies -nolist -dump -post_data http://apps.yrcregional.com/shipmentStatus/track.do 2>&1 | tee $FNAME >/home/jschroff/log.lg
DLV=`grep "PRO" $FNAME | cut --delimiter=: --fields=2 | awk '{print $DLV}'`
echo $1 $DLV > log.txt
rm $FNAME
I'm trying to get the results for the tracking number(PRO number as they call it) 1034548607.
Try doing this with curl :
trackNumber=1234
curl -A Mozilla/5.0 -b cookies -c cookies -kLd "proNumber=$trackNumber" http://apps.yrcregional.com/shipmentStatus/track.do
But verify the TOS to know if you are authorized to scrape this web site.
If you want to parse the output, give us a sample HTML output.

less-style markdown viewer for UNIX systems

I have a Markdown string in JavaScript, and I'd like to display it (with bolding, etc) in a less (or, I suppose, more)-style viewer for the command line.
For example, with a string
"hello\n" +
"_____\n" +
"*world*!"
I would like to have output pop up with scrollable content that looks like
hello
world
Is this possible, and if so how?
Pandoc can convert Markdown to groff man pages.
This (thanks to nenopera's comment):
pandoc -s -f markdown -t man foo.md | man -l -
should do the trick. The -s option tells it to generate proper headers and footers.
There may be other markdown-to-*roff converters out there; Pandoc just happens to be the first one I found.
Another alternative is the markdown command (apt-get install markdown on Debian systems), which converts Markdown to HTML. For example:
markdown README.md | lynx -stdin
(assuming you have the lynx terminal-based web browser).
Or (thanks to Danny's suggestion) you can do something like this:
markdown README.md > README.html && xdg-open README.html
where xdg-open (on some systems) opens the specified file or URL in the preferred application. This will probably open README.html in your preferred GUI web browser (which isn't exactly "less-style", but it might be useful).
I tried to write this in a comment above, but I couldn't format my code block correctly. To write a 'less filter', try, for example, saving the following as ~/.lessfilter:
#!/bin/sh
case "$1" in
*.md)
extension-handler "$1"
pandoc -s -f markdown -t man "$1"|groff -T utf8 -man -
;;
*)
# We don't handle this format.
exit 1
esac
# No further processing by lesspipe necessary
exit 0
Then, you can type less FILENAME.md and it will be formatted like a manpage.
If you are into colors then maybe this is worth checking as well:
terminal_markdown_viewer
It can be used straightforward also from within other programs, or python modules.
And it has a lot of styles, like over 200 for markdown and code which can be combined.
Disclaimer
It is pretty alpha there may be still bugs
I'm the author of it, maybe some people like it ;-)
A totally different alternative is mad. It is a shell script I've just discovered. It's very easy to install and it does render markdown in a console pretty well.
I wrote a couple functions based on Keith's answer:
mdt() {
markdown "$*" | lynx -stdin
}
mdb() {
local TMPFILE=$(mktemp)
markdown "$*" > $TMPFILE && ( xdg-open $TMPFILE > /dev/null 2>&1 & )
}
If you're using zsh, just place those two functions in ~/.zshrc and then call them from your terminal like
mdt README.md
mdb README.md
"t" is for "terminal", "b" is for browser.
Using OSX I prefer to use this command
brew install pandoc
pandoc -s -f markdown -t man README.md | groff -T utf8 -man | less
Convert markupm, format document with groff, and pipe into less
credit: http://blog.metamatt.com/blog/2013/01/09/previewing-markdown-files-from-the-terminal/
This is an alias that encapsulates a function:
alias mdless='_mdless() { if [ -n "$1" ] ; then if [ -f "$1" ] ; then cat <(echo ".TH $1 7 `date --iso-8601` Dr.Beco Markdown") <(pandoc -t man $1) | groff -K utf8 -t -T utf8 -man 2>/dev/null | less ; fi ; fi ;}; _mdless '
Explanation
alias mdless='...' : creates an alias for mdless
_mdless() {...}; : creates a temporary function to be called afterwards
_mdless : at the end, call it (the function above)
Inside the function:
if [ -n "$1" ] ; then : if the first argument is not null then...
if [ -f "$1" ] ; then : also, if the file exists and is regular then...
cat arg1 arg2 | groff ... : cat sends this two arguments concatenated to groff; the arguments being:
arg1: <(echo ".TH $1 7date --iso-8601Dr.Beco Markdown") : something that starts the file and groff will understand as the header and footer notes. This substitutes the empty header from -s key on pandoc.
arg2: <(pandoc -t man $1) : the file itself, filtered by pandoc, outputing the man style of file $1
| groff -K utf8 -t -T utf8 -man 2>/dev/null : piping the resulting concatenated file to groff:
-K utf8 so groff understands the input file code
-t so it displays correctly tables in the file
-T utf8 so it output in the correct format
-man so it uses the MACRO package to outputs the file in man format
2>/dev/null to ignore errors (after all, its a raw file being transformed in man by hand, we don't care the errors as long as we can see the file in a not-so-much-ugly format).
| less : finally, shows the file paginating it with less (I've tried to avoid this pipe by using groffer instead of groff, but groffer is not as robust as less and some files hangs it or do not show at all. So, let it go through one more pipe, what the heck!
Add it to your ~/.bash_aliases (or alike)
I personally use this script:
#!/bin/bash
id=$(uuidgen | cut -c -8)
markdown $1 > /tmp/md-$id
google-chrome --app=file:///tmp/md-$id
It renders the markdown into HTML, puts it into a file in /tmp/md-... and opens that in a kiosk chrome session with no URI bar etc.. You just pass the md file as an argument or pipe it into stdin. Requires markdown and Google Chrome. Chromium should also work but you need to replace the last line with
chromium-browser --app=file:///tmp/md-$id
If you wanna get fancy about it, you can use some css to make it look nice, I edited the script and made it use Bootstrap3 (overkill) from a CDN.
#!/bin/bash
id=$(uuidgen | cut -c -8)
markdown $1 > /tmp/md-$id
sed -i "1i <html><head><style>body{padding:24px;}</style><link rel=\"stylesheet\" type=\"text/css\" href=\"http://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/css/bootstrap.min.css\"></head><body>" /tmp/md-$id
echo "</body>" >> /tmp/md-$id
google-chrome --app=file:///tmp/md-$id > /dev/null 2>&1 &
I'll post my unix page answer here, too:
An IMHO heavily underestimated command line markdown viewer is the markdown-cli.
Installation
npm install markdown-cli --global
Usage
markdown-cli <file>
Features
Probably not noticed much, because it misses any documentation...
But as far as I could figure out by some example markdown files, some things that convinced me:
handles ill formatted files much better (similarly to atom, github, etc.; eg. when blank lines are missing before lists)
more stable with formatting in headers or lists (bold text in lists breaks sublists in some other viewers)
proper table formatting
syntax highlightning
resolves footnote links to show the link instead of the footnote number (not everyone might want this)
Screenshot
Drawbacks
I have realized the following issues
code blocks are flattened (all leading spaces disappear)
two blank lines appear before lists

Resources