Using Bash to cURL a website and grep for keywords

Using Bash to cURL a website and grep for keywords - linux

I'm trying to write a script that will do a few things in the following order:
cURL websites from a list of urls contained within a "url_list.txt" (new-line delineated) file.
For each website in the list, I want to grep that website looking for keywords contained within a "keywords.txt" (new-line delineated) file.
I want to finish by printing to the terminal in the following format (or something similar):
$URL (that contained match) : $keyword (that made the match)
It needs to be able to run in Ubuntu (GNU grep, etc.)
It does not need to be cURL and grep; as long as the functionality is there.
So far I've got:
#!/bin/bash
keywords=$(cat ./keywords.txt)
urllist=$(cat ./url_list.txt)
for url in $urllist; do
content="$(curl -L -s "$url" | grep -iF "$keywords" /dev/null)"
echo "$content"
done
But for some reason, no matter what I try to tweak or change, it keeps failing to one degree or another.
How can I go about accomplishing this task?
Thanks

Here's how I would do it:
#!/bin/bash
keywords="$(<./keywords.txt)"
while IFS= read -r url; do
curl -L -s "$url" | grep -ioF "$keywords" |
while IFS= read -r keyword; do
echo "$url: $keyword"
done
done < ./url_list.txt
What did I change:
I used $(<./keywords.txt) to read the keywords.txt. This does not rely on an external program (cat in your original script).
I changed the for loop that loops over the url list, into a while loop. This guarentees that we use Θ(1) memory (i.e. we don't have to load the entire url list in memory).
I remove /dev/null from grep. greping from /dev/null alone is meaningless, since it will find nothing there. Instead, I invoke grep with no arguments so that it filters its stdin (which happens to be the output of curl in this case).
I added the -o flag for grep so that it outputs only the matched keyword.
I removed the subshell where you were capturing the output of curl. Instead I run the command directly and feed its output to a while loop. This is necessary because we might get more than keyword match per url.

Related

Retrieve URL components using bash

I have a massive list of URLs in a text file, which I'd like to download using wget. This seems simple enough:
#!/bin/bash
cat list.txt | \
while read CMD; do
wget $CMD; done;
However, wget uses the basename of the file as the download location, which results in renaming schemes, such as file.txt.1, file.txt.2 and so on.
An $URL can look like this:
http://sub.domain.com/some/folder/to/file.txt
Where http://sub.domain.com/some/ is always the same. Now, in JS I would do $URL.split("http://sub.domain.com/some/")[1], but this doesn't quite seem to work in Bash:
IFS="http://sub.domain.com/some/" read -a url <<< "http://sub.domain.com/some/folder/to/file.txt"
echo "${url[1]}"; // always empty.

Use the shell's parameter expansion operator to remove the prefix:
base=${CMD#http://sub.domain.com/some/}
BTW, you should get out of the habit of using all-uppercase variable names in shell scripts. These are conventionally used for environment variables.

If the length of the prefix is static you could do the following:
#!/bin/bash
while read line
do
suffix=${line:${#line} - LENGTH}
wget $line -O $suffix
done < "list.txt"

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Microsoft has provided a URL where you can download the public IP ranges used by Microsoft Azure.
https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653
The question is, is there a way to download the XML file from that site automatically using Python or other scripts? I am trying to schedule a task to grab the new IP range file and process it for my firewall policies.

Yes! There is!
But of course you need to do a bit more work than expected to achieve this.
I discovered this question when I realized Microsoft’s page only works via JavaScript when loaded in a browser. Which makes automation via Curl practically unusable; I’m not manually downloading this file when updates happen. Instead IO came up with this Bash “one-liner” that works well for me.
First, let’s define that base URL like this:
MICROSOFT_IP_RANGES_URL="https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653"
Now just run this Curl command—that parses the returned web page HTML with a few chained Greps—like this:
curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+'
The returned output is a nice and clean final URL like this:
https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20181224.xml
Which is exactly what one needs to automatic the direct download of that XML file. If you need to then assign that value to a URL just wrap it in $() like this:
MICROSOFT_IP_RANGES_URL_FINAL=$(curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+')
And just access that URL via $MICROSOFT_IP_RANGES_URL_FINAL and there you go.

if you can use wget writing in a text editor and saving it as scriptNameYouWant.sh will give you a bash script that will download the xml file you want.
#! /bin/bash
wget http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml
Of course, you can run the command directly from a terminal and have the exact result.
If you want to use Python, one way is:
Again, write in a text editor and save as scriptNameYouWant.py the folllowing:
import urllib2
page = urllib2.urlopen("http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml").read()
f = open("xmlNameYouWant.xml", "w")
f.write(str(page))
f.close()
I'm not that good with python or bash though, so I guess there is a more elegant way in both python or bash.
EDIT
You get the xml despite the dynamic url using curl. What worked for me is this:
#! /bin/bash
FIRSTLONGPART="http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_"
INITIALURL="http://www.microsoft.com/EN-US/DOWNLOAD/confirmation.aspx?id=41653"
OUT="$(curl -s $INITIALURL | grep -o -P '(?<='$FIRSTLONGPART').*(?=.xml")'|tail -1)"
wget -nv $FIRSTLONGPART$OUT".xml"
Again, I'm sure that there is a more elegant way of doing this.

Here's an one-liner to get the URL of the JSON file:
$ curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq

I enhanced the last response a bit and I'm able to download the JSON file:
#! /bin/bash
download_link=$(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq | grep -v refresh)
if [ $? -eq 0 ]
then
wget $download_link ; echo "Latest file downloaded"
else
echo "Download failed"
fi

Slight mod to ushuz script from 2020 - as that was producing 2 different entries and unique doesn't work for me in Windows -
$list=#(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?.json')
$list.Get(2)
That outputs the valid path to the json, which you can punch into another variable for use for grepping or whatever.

Redirecting linux cout to a variable and the screen in a script

I am currently trying to make a script file that runs multiple other script files on a server. I would like to display the output of these script to the screen IN ADDITION to passing it into grep so I can do error testing. currently I have written this:
status=$(SOMEPROCESS | grep -i "SOMEPROCESS started completed correctly")
I do further error handling below this using the variable status, so I would like to display SOMEPROCESS's output to the screen for error reference. This is a read only server and I can not save the output to a log file.

You need to use the tee command. It will be slightly fiddly, since tee outputs to a file handle. However you could create a file descriptor using pipe.
Or (simpler) for your use case.
Start the script without grep and pipe it through tee SOMEPROCESS | tee /my/safely/generated/filename. Then use tail -f /my/safely/generated/filename | grep -i "my grep pattern separately.

You can use process substituion together with tee:
SOMEPROCESS | tee >(grep ...)
This will use an anonymous pipe and pass /dev/fd/... as file name to tee (or a named pipe on platforms that don't support /dev/fd/...).
Because SOMEPROCESS is likely to buffer its output when not talking to a terminal, you might see significant lag in screen output.

I'm not sure whether I understood your question exactly.
I think you want to get the output of SOMEPROCESS, test it, print it out when there are errors. If it is, I think the code bellow may help you:
s=$(SOMEPROCESS)
grep -q 'SOMEPROCESS started completed correctly' <<< $s
if [[ $? -ne 0 ]];then
# specified string not found in the output, it means SOMEPROCESS started failed
echo $s
fi
But in this code, it will store the all output in the memory, if the output is big enough, there will be a OOM risk.

Grep filtering output from a process after it has already started?

Normally when one wants to look at specific output lines from running something, one can do something like:
./a.out | grep IHaveThisString
but what if IHaveThisString is something which changes every time so you need to first run it, watch the output to catch what IHaveThisString is on that particular run, and then grep it out? I can just dump to file and later grep but is it possible to do something like background it and then bring it to foreground and bringing it back but now piped to some grep? Something akin to:
./a.out
Ctrl-Z
fg | grep NowIKnowThisString
just wondering..

No, it is only in your screen buffer if you didn't save it in some other way.

Short form: You can do this, but you need to know that you need to do it ahead-of-time; it's not something that can be put into place interactively after-the-fact.
Write your script to determine what the string is. We'd need a more detailed example of the output format to give a better example of usage, but here's one for the trivial case where the entire first line is the filter target:
run_my_command | { read string_to_filter_for; fgrep -e "$string_to_filter_for" }
Replace the read string_to_filter_for with as many commands as necessary to read enough input to determine what the target string is; this could be a loop if necessary.
For instance, let's say that the output contains the following:
Session id: foobar
and thereafter, you want to grep for lines containing foobar.
...then you can pipe through the following script:
re='Session id: (.*)'
while read; do
if [[ $REPLY =~ $re ]] ; then
target=${BASH_REMATCH[1]}
break
else
# if you want to print the preamble; leave this out otherwise
printf '%s\n' "$REPLY"
fi
done
[[ $target ]] && grep -F -e "$target"
If you want to manually specify the filter target, this can be done by having the loop check for a file being created with filter contents, and using that when starting up grep afterwards.

That is a little bit strange what you need, but you can do it tis way:
you must go into script session first;
then you use shell how usually;
then you start and interrupt you program;
then run grep over typescript file.
Example:
$ script
$ ./a.out
Ctrl-Z
$ fg
$ grep NowIKnowThisString typescript

You could use a stream editor such as sed instead of grep. Here's an example of what I mean:
$ cat list
Name to look for: Mike
Dora 1
John 2
Mike 3
Helen 4
Here we find the name to look for in the fist line and want to grep for it. Now piping the command to sed:
$ cat list | sed -ne '1{s/Name to look for: //;h}' \
> -e ':r;n;G;/^.*\(.\+\).*\n\1$/P;s/\n.*//;br'
Mike 3
Note: sed itself can take file as a parameter, but you're not working with text files, so that's how you'd use it.
Of course, you'd need to modify the command for your case.

less-style markdown viewer for UNIX systems

I have a Markdown string in JavaScript, and I'd like to display it (with bolding, etc) in a less (or, I suppose, more)-style viewer for the command line.
For example, with a string
"hello\n" +
"_____\n" +
"*world*!"
I would like to have output pop up with scrollable content that looks like
hello
world
Is this possible, and if so how?

Pandoc can convert Markdown to groff man pages.
This (thanks to nenopera's comment):
pandoc -s -f markdown -t man foo.md | man -l -
should do the trick. The -s option tells it to generate proper headers and footers.
There may be other markdown-to-*roff converters out there; Pandoc just happens to be the first one I found.
Another alternative is the markdown command (apt-get install markdown on Debian systems), which converts Markdown to HTML. For example:
markdown README.md | lynx -stdin
(assuming you have the lynx terminal-based web browser).
Or (thanks to Danny's suggestion) you can do something like this:
markdown README.md > README.html && xdg-open README.html
where xdg-open (on some systems) opens the specified file or URL in the preferred application. This will probably open README.html in your preferred GUI web browser (which isn't exactly "less-style", but it might be useful).

I tried to write this in a comment above, but I couldn't format my code block correctly. To write a 'less filter', try, for example, saving the following as ~/.lessfilter:
#!/bin/sh
case "$1" in
*.md)
extension-handler "$1"
pandoc -s -f markdown -t man "$1"|groff -T utf8 -man -
;;
*)
# We don't handle this format.
exit 1
esac
# No further processing by lesspipe necessary
exit 0
Then, you can type less FILENAME.md and it will be formatted like a manpage.

If you are into colors then maybe this is worth checking as well:
terminal_markdown_viewer
It can be used straightforward also from within other programs, or python modules.
And it has a lot of styles, like over 200 for markdown and code which can be combined.
Disclaimer
It is pretty alpha there may be still bugs
I'm the author of it, maybe some people like it ;-)

A totally different alternative is mad. It is a shell script I've just discovered. It's very easy to install and it does render markdown in a console pretty well.

I wrote a couple functions based on Keith's answer:
mdt() {
markdown "$*" | lynx -stdin
}
mdb() {
local TMPFILE=$(mktemp)
markdown "$*" > $TMPFILE && ( xdg-open $TMPFILE > /dev/null 2>&1 & )
}
If you're using zsh, just place those two functions in ~/.zshrc and then call them from your terminal like
mdt README.md
mdb README.md
"t" is for "terminal", "b" is for browser.

Using OSX I prefer to use this command
brew install pandoc
pandoc -s -f markdown -t man README.md | groff -T utf8 -man | less
Convert markupm, format document with groff, and pipe into less
credit: http://blog.metamatt.com/blog/2013/01/09/previewing-markdown-files-from-the-terminal/

This is an alias that encapsulates a function:
alias mdless='_mdless() { if [ -n "$1" ] ; then if [ -f "$1" ] ; then cat <(echo ".TH $1 7 `date --iso-8601` Dr.Beco Markdown") <(pandoc -t man $1) | groff -K utf8 -t -T utf8 -man 2>/dev/null | less ; fi ; fi ;}; _mdless '
Explanation
alias mdless='...' : creates an alias for mdless
_mdless() {...}; : creates a temporary function to be called afterwards
_mdless : at the end, call it (the function above)
Inside the function:
if [ -n "$1" ] ; then : if the first argument is not null then...
if [ -f "$1" ] ; then : also, if the file exists and is regular then...
cat arg1 arg2 | groff ... : cat sends this two arguments concatenated to groff; the arguments being:
arg1: <(echo ".TH $1 7date --iso-8601Dr.Beco Markdown") : something that starts the file and groff will understand as the header and footer notes. This substitutes the empty header from -s key on pandoc.
arg2: <(pandoc -t man $1) : the file itself, filtered by pandoc, outputing the man style of file $1
| groff -K utf8 -t -T utf8 -man 2>/dev/null : piping the resulting concatenated file to groff:
-K utf8 so groff understands the input file code
-t so it displays correctly tables in the file
-T utf8 so it output in the correct format
-man so it uses the MACRO package to outputs the file in man format
2>/dev/null to ignore errors (after all, its a raw file being transformed in man by hand, we don't care the errors as long as we can see the file in a not-so-much-ugly format).
| less : finally, shows the file paginating it with less (I've tried to avoid this pipe by using groffer instead of groff, but groffer is not as robust as less and some files hangs it or do not show at all. So, let it go through one more pipe, what the heck!
Add it to your ~/.bash_aliases (or alike)

I personally use this script:
#!/bin/bash
id=$(uuidgen | cut -c -8)
markdown $1 > /tmp/md-$id
google-chrome --app=file:///tmp/md-$id
It renders the markdown into HTML, puts it into a file in /tmp/md-... and opens that in a kiosk chrome session with no URI bar etc.. You just pass the md file as an argument or pipe it into stdin. Requires markdown and Google Chrome. Chromium should also work but you need to replace the last line with
chromium-browser --app=file:///tmp/md-$id
If you wanna get fancy about it, you can use some css to make it look nice, I edited the script and made it use Bootstrap3 (overkill) from a CDN.
#!/bin/bash
id=$(uuidgen | cut -c -8)
markdown $1 > /tmp/md-$id
sed -i "1i <html><head><style>body{padding:24px;}</style><link rel=\"stylesheet\" type=\"text/css\" href=\"http://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/css/bootstrap.min.css\"></head><body>" /tmp/md-$id
echo "</body>" >> /tmp/md-$id
google-chrome --app=file:///tmp/md-$id > /dev/null 2>&1 &

I'll post my unix page answer here, too:
An IMHO heavily underestimated command line markdown viewer is the markdown-cli.
Installation
npm install markdown-cli --global
Usage
markdown-cli <file>
Features
Probably not noticed much, because it misses any documentation...
But as far as I could figure out by some example markdown files, some things that convinced me:
handles ill formatted files much better (similarly to atom, github, etc.; eg. when blank lines are missing before lists)
more stable with formatting in headers or lists (bold text in lists breaks sublists in some other viewers)
proper table formatting
syntax highlightning
resolves footnote links to show the link instead of the footnote number (not everyone might want this)
Screenshot
Drawbacks
I have realized the following issues
code blocks are flattened (all leading spaces disappear)
two blank lines appear before lists

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using Bash to cURL a website and grep for keywords - linux

Related

Retrieve URL components using bash

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Redirecting linux cout to a variable and the screen in a script

Grep filtering output from a process after it has already started?

less-style markdown viewer for UNIX systems

Categories

Resources