From user-agent string determine what browser made the request - linux

I am trying to write a batch file for reading and extracting user agent from a log file which i am able to do with the following code but i need to give numeric count of the browser that made the requests and Using gnu-plot, plot a bar-chart of the number of requests per browser.i am kinda stuck with the browser requests a little help or direction will be appreciated.
Cheers.
#!/bin/bash
# All we're doing here is extracting the user agent field from the log file and 'piping' it through some other commands. The first sort is to # enable uniq to properly identify and count unique user agents. The final sort orders the result by number and name (both descending).
awk -F\" '{print $6}' access.log | sort | uniq -c | sort -fr > extracteduseragents.txt

To get friendlier names for the browsers, you can use e.g. pybrowscap together with the browscap.csv file from http://tempdownloads.browserscap.com/index.php.
Then you could use a script like the following sanitize_ua.py:
#!/usr/bin/env python
import sys
from pybrowscap.loader.csv import load_file
browscap = load_file('browscap.csv')
for ua in sys.stdin:
browser = browscap.search(ua)
if browser:
print "'{} {}'".format(browser.name(), browser.version())
And run from the command line like
awk -F\" '{print $6}' access.log | sort | python sanitize_ua.py | uniq -c | sort -fr
Of course, searching all user agents before uniq is very unefficient, but it should show the working principle. And, of course, you could also write a single python script to do all the processing.

Related

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

Bash input the data from a text file and process it

I want to create a bash script which will take entries from a text file and process it based on the requirement.
I am trying to find out the last login time for each of the mail accounts. I have a text file email.txt contains all the email addresses line by line and need to check the last login time for each of these account using the below command :
cat /var/log/maillog | grep 'mail1#address.com' | grep 'Login' | tail -1 > result.txt
So the result would be inside result.txt
Thanks,
I think you are looking for something like this.
while read email_id ; do grep "$email_id" /var/log/maillog | grep 'Login' >> result.txt ; done < email.txt
To get all the email addresses in parallel (rather than repeatedly scan what is probably a very large log file).
grep -f file_with_email_addrs /var/log/maillog
Weather this is worthwile depends on the size of the log file and the number of e-mail addresses you are getting. A small log file or a small number of users will probably render this mute. But a big log file and a lot of users will make this a good first step. But will then require some extra processing.
I am unsure about the format of the log lines or the relative distribution of Logins in the file, so what happens next depends on more information.
# If logins are rare?
grep Login /var/log/maillog | grep -f file_with_email_addrs | sort -u -k <column of email>
# If user count is small
grep -f file_with_email_addrs /var/log/maillog | awk '/Login/ {S[$<column of email>] = $0;} END {for (loop in S) {printf("%s\n",S[loop]);}'
Your question is unclear, do you want to look for the last login for 'mail_address1' which is read from a file?
If so:
#/bin/bash
while read mail_address1;
do;
# Your command line here
cat /var/log/maillog | grep $mail_address1 | grep 'Login' | tail -1 >> result.txt
done; < file_with_email_adddrs
PS: I've changed the answer to reflect the suggestions (extremely valid ones) in the comments that advise against using a for i in $(cat file) style loop to loop through a file.

Extract multiple substrings in bash

I have a page exported from a wiki and I would like to find all the links on that page using bash. All the links on that page are in the form [wiki:<page_name>]. I have a script that does:
...
# First search for the links to the pages
search=`grep '\[wiki:' pages/*`
# Check is our search turned up anything
if [ -n "$search" ]; then
# Now, we want to cut out the page name and find unique listings
uniquePages=`echo "$search" | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d':' -f2 | cut -d' ' -f 1 | sort -u`
....
However, when presented with a grep result with multiple [wiki: text in it, it only pulls the last one and not any others. For example if $search is:
Before starting the configuration, all the required libraries must be installed to be detected by Cmake. If you have missed this step, see the [wiki:CT/Checklist/Libraries "Libr By pressing [t] you can switch to advanced mode screen with more details. The 5 pages are available [wiki:CT/Checklist/Cmake/advanced_mode here]. To obtain information about ea - '''Installation of Cantera''': If Cantera has not been correctly installed or if you do not have sourced the setup file '''~/setup_cantera''' you should receive the following message. Refer to the [wiki:CT/FormulationCantera "Cantera installation"] page to fix this problem. You can set the Cantera options to OFF if you plan to use built-in transport, thermodynamics and chemistry.
then it only returns CT/FormulationCantera and it doesn't give me any of the other links. I know this is due to using cut so I need a replacement for the $uniquepages line.
Does anybody have any suggestions in bash? It can use sed or perl if needed, but I'm hoping for a one-liner to extract a list of page names if at all possible.
egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//' | sort -u
upd. to remove all after space without cut
egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//;s/ .*//' | sort -u

Looping through a text file containing domains using bash script

I have written a script that reads href tag of a webpage and fetches the links on that webpage and writes them to a text file. Now I have a text file containing links such as these for example:
http://news.bbc.co.uk/2/hi/health/default.stm
http://news.bbc.co.uk/weather/
http://news.bbc.co.uk/weather/forecast/8?area=London
http://newsvote.bbc.co.uk/1/shared/fds/hi/business/market_data/overview/default.stm
http://purl.org/dc/terms/
http://static.bbci.co.uk/bbcdotcom/0.3.131/style/3pt_ads.css
http://static.bbci.co.uk/frameworks/barlesque/2.8.7/desktop/3.5/style/main.css
http://static.bbci.co.uk/frameworks/pulsesurvey/0.7.0/style/pulse.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie6.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie7.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie8.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/main.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/img/iphone.png
http://www.bbcamerica.com/
http://www.bbc.com/future
http://www.bbc.com/future/
http://www.bbc.com/future/story/20120719-how-to-land-on-mars
http://www.bbc.com/future/story/20120719-road-opens-for-connected-cars
http://www.bbc.com/future/story/20120724-in-search-of-aliens
http://www.bbc.com/news/
I would like to be able to filter them such that I return something like:
http://www.bbc.com : 6
http://static.bbci.co.uk: 15
The values on the the side indicate the number of times the domain appears in the file. How can i be able to achieve this in bash considering I would have a loop going through the file. I am a newbie to bash shell scripting?
$ cut -d/ -f-3 urls.txt | sort | uniq -c
3 http://news.bbc.co.uk
1 http://newsvote.bbc.co.uk
1 http://purl.org
8 http://static.bbci.co.uk
1 http://www.bbcamerica.com
6 http://www.bbc.com
Just like this
egrep -o '^http://[^/]+' domain.txt | sort | uniq -c
Output of this on your example data:
3 http://news.bbc.co.uk/
1 http://newsvote.bbc.co.uk/
1 http://purl.org/
8 http://static.bbci.co.uk/
6 http://www.bbc.com/
1 http://www.bbcamerica.com/
This solution works even if your line is made up of a simple url without a trailing slash, so
http://www.bbc.com/news
http://www.bbc.com/
http://www.bbc.com
will all be in the same group.
If you want to allow https, then you can write:
egrep -o '^https?://[^/]+' domain.txt | sort | uniq -c
If other protocols are possible, such as ftp, mailto, etc. you can even be very loose and write:
egrep -o '^[^:]+://[^/]+' domain.txt | sort | uniq -c

Bash script to get server health

Im looking to monitor some aspects of a farm of servers that are necessary for the application that runs on them.
Basically, Im looking to have a file on each machine, which when accessed via http (on a vlan), with curl, that will spit out information Im looking for, which I can log into the database with dameon that sits in a loop and checks the health of all the servers one by one.
The info Im looking to get is
<load>server load</load>
<free>md0 free space in MB</free>
<total>md0 total space in MB</total>
<processes># of nginx processes</processes>
<time>timestamp</time>
Whats the best way of doing that?
EDIT: We are using cacti and opennms, however what Im looking for here is data that is necessary for the application that runs on these servers. I dont want to complicate it by having it rely on any 3rd party software to fetch this basic data which can be gotten with a few linux commands.
Make a cron entry that:
executes a shell script every few minutes (or whatever frequency you want)
saves the output in a directory that's published by the web server
Assuming your text is literally what you want, this will get you 90% of the way there:
#!/usr/bin/env bash
LOAD=$(uptime | cut -d: -f5 | cut -d, -f1)
FREE=$(df -m / | tail -1 | awk '{ print $4 }')
TOTAL=$(df -m / | tail -1 | awk '{ print $2 }')
PROCESSES=$(ps aux | grep [n]ginx | wc -l)
TIME=$(date)
cat <<-EOF
<load>$LOAD</load>
<free>$FREE</free>
<total>$TOTAL</total>
<processes>$PROCESSES</processes>
<time>$TIME</time>
EOF
Sample output:
<load> 0.05</load>
<free>9988</free>
<total>13845</total>
<processes>6</processes>
<time>Wed Apr 18 22:14:35 CDT 2012</time>

Resources