Retrieving User Agent from Apache/Nginx Access.log

Retrieving User Agent from Apache/Nginx Access.log - linux

I have the command below which prints out hits, host IP (local server/load balancer) and external IP (the one causing the hit) I would also like to print out the User Agent information alongside the information given. How can this be achieved please?
cat access.log | sed -e 's/^\([[:digit:]\.]*\).*"\(.*\)"$/\1 \2/' | sort -n | uniq -c | sort -nr | head -20
What I get is below...
Hits, Host IP, External IP
What I'd like if possible...
Hits, IP (host example), External IP (causing the hit), User Agent
10000 192.168.1.1 148.285.xx.xx Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/98 Safari/537.4
Attached below is an excerpt from the log
192.168.xxx.x - - [10/Jun/2019:12:40:15 +0100] "GET /company-publications/152005 HTTP/1.1" 200 55848 "google.com" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6" "xx.xx.xx.xx"

If GNU AWK (gawk) is available, please try the following:
awk -v FPAT='(\"[^"]+\")|(\\[[^]]+])|([^ ]+)' '
{ gsub("\"", "", $9); gsub("\"", "", $10); print $1 " " $10 " " $9 }
' access.log | sort -n | uniq -c | sort -nr | head -20
The value of FPAT represents a regex of each field in access.log. That is: "string surrounded by double quotes", "string surrounded by square
brackets" or "string separated by whitespaces".
Then you can split each line of access.log into fields: $1 for host IP,
$10 for external IP, and $9 for user agent.

Related

Parsing Apache logs with bash

I want to parse an Apache log file such as:
1.1.1.1 - - [12/Dec/2019:18:25:11 +0100] "GET /endpoint1/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
1.1.1.1 - - [13/Dec/2019:18:25:11 +0100] "GET /endpoint1/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
2.2.2.2 - - [13/Dec/2019:18:27:11 +0100] "GET /endpoint1/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
2.2.2.2 - - [13/Jan/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
3.3.3.3 - - [13/Jan/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
1.1.1.1 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
4.4.4.4 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
4.4.4.4 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
4.4.4.4 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
I need to get list of clients IPs visited per month. I have something like this
awk '{print $1,$4}' access.log | grep Dec | cut -d" " -f1 | uniq -c
but this is wrong because it counts visits ip per day.
The expected result should be like (indentation doesn't matter):
Dec 2019
1.1.1.1 2
2.2.2.2 1
Jan 2020
2.2.2.2 1
3.3.3.3 1
Feb 2020
4.4.4.4 3
1.1.1.1 1
where 2 are total amount of visits from 1.1.1.1 ip per Dec 2019.
Could you suggest me an approach how to do it?

One for GNU awk, that outputs in the order the data was fed in (ie. chronological data such as log records should be output in that order):
$ gawk ' # using GNU awk
BEGIN {
a[""][""] # initialize a 2D array
}
{
split($4,t,/[/:]/) # split datetime
my=t[2] OFS t[3] # my=month year
if(!(my in mye)) { # if current my unseen
mye[my]=++myi # update month year exists array with new index
mya[myi]=my # chronology is made
}
a[mye[my]][$1]++ # update record to a hash
}
END { # in the end
# PROCINFO["sorted_in"]="#val_num_desc" # this may work for ordering visits
for(i=1;i<=myi;i++) { # in fed order
print mya[i] # print month year
for(j in a[i]) # then related ips in no particular order
print j,a[i][j] # output ip and count
}
}' file
Output:
Dec 2019
1.1.1.1 2
2.2.2.2 1
Jan 2020
2.2.2.2 1
3.3.3.3 1
Feb 2020
1.1.1.1 1
4.4.4.4 3

Though your sample expected output doesn't look to match your shown sample, based on your shown sample output and description, could you please try following. Also since this is a log file I will go with field separators method of awk since logs will be of fixed pattern.
awk -F':| |-|/+|]' '
{
ind[$7 OFS $8 OFS $1]++
value[$7 OFS $8 OFS $1]=$1
}
END{
for(i in value){
split(i,arr," ")
print arr[1],arr[2] ORS value[i],ind[i]
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk -F':| |-|/+|]' ' ##Starting awk program from here and setting field separators as : space - / ] here.
{
ind[$7 OFS $8 OFS $1]++ ##Creating ind array whose index is 7th 8th and 1st field and keep increasing value with 1 here.
value[$7 OFS $8 OFS $1]=$1 ##Creating value with index of 7th, 8th and 1st field and its value is 1st field.
}
END{ ##Starting END block of this program from here.
for(i in value){ ##Traversing through value elements here.
split(i,arr," ") ##Splitting i into array arr with delimiter as space here.
print arr[1],arr[2] ORS value[i],ind[i] ##Printing 1st and 2nd element of arr with ORS(new line) and array value and ind value here.
}
}' Input_file ##Mentioning Input_file name here.

try this..
shell:
#!/usr/bin/env bash
LOG_FILE=$1
#regex to find mmm/yyyy
dateUniq=`grep -oP '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\/\d{4}' $LOG_FILE | sort | uniq`
for i in $dateUniq
do
#output mmm yyyy
echo $i | sed 's/\// /g'
#regex to find ip
ipUniq=`grep $i $LOG_FILE | grep -oP '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' | sort | uniq`
for x in $ipUniq
do
count=`grep $i $LOG_FILE |grep -c $x`
#output count ip
echo $count $x
done
echo
done
output:
Dec 2019
2 1.1.1.1
1 2.2.2.2
Feb 2020
1 1.1.1.1
3 4.4.4.4
Jan 2020
1 2.2.2.2
1 3.3.3.3

for a quick summarize access log. just run below commands.
cat /var/log/apache2/access.log|awk '{print $1}'|sort -nr |uniq -c |sort -nr |head -n 25

AWK command to print the pattern as below from pattern provided

I have a pattern as below:
Pattern in a unix file:
{1.11.111.111 - 2017-10-06T00:00:00+00:00 111111 1 302 "GET /abcd/z1/bcdfgggg?values" uri="/abcd/v2/nano" 111 111 0 "-" "abcd/2.1.0 (Linux; U; Android 8.1.0; Redmi Note 6 Pro MIUI/V10.2.2.0.bcdwvc)" "1111:1111:111:1111:11:d11e:c11c:111a" cu=0.011 nano=0.011 var="-12345" "1111:1111:111:1111:11:d11e:c11c:111a, 11.111.111.111"}
I am trying to print the below result but the result is not printed as expected.
Code:
Cat test.txt | awk -F'"' '{ print $1,$9}' |awk -F' ' '{ print $3,$6,$24}'
Actual Result: 2017-10-06T00:00:00+00:00 302
Expected Result: 2017-10-06T00:00:00+00:00 302 cu=0.011

With GNU sed and a regex with three backreferences:
sed -r 's/.* ([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:+]{14}) [0-9]+ [0-9]+ ([0-9]{3}) .*(cu=[0-9.]+).*/\1 \2 \3/' file
Output:
2017-10-06T00:00:00+00:00 302 cu=0.011
See: The Stack Overflow Regular Expressions FAQ

Assuming the log entry will always look like presented by the OP:
pattern='{1.11.111.111 - 2017-10-06T00:00:00+00:00 111111 1 302 "GET /abcd/z1/bcdfgggg?values" uri="/abcd/v2/nano" 111 111 0 "-" "abcd/2.1.0 (Linux; U; Android 8.1.0; Redmi Note 6 Pro MIUI/V10.2.2.0.bcdwvc)" "1111:1111:111:1111:11:d11e:c11c:111a" cu=0.011 nano=0.011 var="-12345" "1111:1111:111:1111:11:d11e:c11c:111a, 11.111.111.111"}'
awk -F ' ' '{print $3,$6,$25}' <<< "$pattern"
Output: 2017-10-06T00:00:00+00:00 302 cu=0.011

Linux Shell Script: How to compare a specific field in a text document with specific text in an if statement

I have a file named transfer.log that has a few Apache logs. I need to count how many GET requests each IP address has logged. I know how to access the file and loop through the lines in the file but I am having trouble with comparing the 6th field in each line with "GET".
#!/bin/bash
while read p;
do
name=( $(awk '{print $6}' p))
echo $name
if [ "$name" == "GET" ]
then
echo "yes"
else
echo "no"
fi
done < transfer.log
Currently, when I run the script "no" is printed 5 times and I receive an error that says awk cannot open the file "p". When I change the p to transfer.log in the variable declaration, I can get the echo $name to output "GET (with the quotation), but it obviously never changes because it is accessing the entire file and not the new line p.
I need to know how to assign the 6th column of p to my variable name each time the while loop executes. Also, I am confused as to why my loop only iterates 5 times and not 6.
My transfer.log looks like this:
140.211.167.27 - - [15/Oct/2012:23:11:38 +0000] "GET / HTTP/1.1" 200 2963 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"
140.211.167.27 - - [15/Oct/2012:23:11:46 +0000] "GET /systems/ganeti/index HTTP/1.1" 200 5918 "https://wiki.osuosl.org/systems/index" "Mozilla/5.0(X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"
140.211.167.9 - - [15/Oct/2012:23:17:33 +0000] "GET /resources/index HTTP/1.1" 200 3411 "https://wiki.osuosl.org/index" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
140.211.167.25 - - [15/Oct/2012:16:02:07 +0000] "GET /index HTTP/1.1" 200 2673 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
66.249.74.101 - - [15/Oct/2012:02:20:14 +0000] "GET /robots.txt HTTP/1.1" 404 2458 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
128.193.164.34 - - [15/Oct/2012:12:41:18 +0000] "POST /rpc/xmlrpc HTTP/1.0" 200 8328 "-" "PHP XMLRPC 1.0"
Ultimately, I need to count how many GET requests each specific IP address has logged and sort the addresses by least to greatest GET requests.

You can use the following awk command to do the trick:
$ awk '{if($6=="\"GET")ip[$1]++; else ip[$1]+=0}END{for(elem in ip){print elem, ip[elem]}}' input.log | sort -k2nr
140.211.167.27 2
140.211.167.25 1
140.211.167.9 1
66.249.74.101 1
128.193.164.34 0
Explanations:
{if($6=="\"GET")ip[$1]++; else ip[$1]+=0} on each line of the file it will check the 6th field and if it is equal to "GET it will increment an array for which the index is the ip; if the 6th field is not equal to "GET it will add 0 to the array in order to take into account the ip that have done some POST you can remove this logic if you do not id it.
Then at the end of the file it will print each ip plus the number of GET
Everything is piped to a sort command that will sort the output with the second field and order in reverse numerical order

The following line is wrong:
name=( $(awk '{print $6}' p))
You should replace it with:
name=$(echo "$p" | awk '{print $6}')
You passed p, the variable name, to the awk, where a file name was expected. Also, the outer brackets were redundant.

I try to parse the log file, for your reference:
#!/bin/bash
howmanyGET=0
loopcounter=0
while read line;do
#echo "Line # $loopcounter: $line"
((loopcounter++))
name=`echo $line | awk '{print $6}'`
#name=( $(awk '{print $6}' p))
#echo $name
name=${name:1:3}
echo $name
if [ "$name" == "GET" ]
then
echo "yes"
((howmanyGET++))
else
echo "no"
fi
done < transfer.log
echo "GET: $howmanyGET"
echo "loop: $loopcounter"
output here:
$ bash counter.sh
GET
yes
GET
yes
GET
yes
GET
yes
GET
yes
POS
no
GET: 5
loop: 6
Hope to be helpful.

Using awk to generate report from apache http logs

Hoping someone can help me with a bash linux script to generate report from http logs.
Logs format:
domain.com 101.100.144.34 - r.c.bob [14/Feb/2017:11:31:20 +1100] "POST /webmail/json HTTP/1.1" 200 1883 "https://example.domain.com/webmail/index-rui.jsp?v=1479958955287" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko" 1588 2566 "110.100.34.39" 9FC1CC8A6735D43EF75892667C08F9CE 84670 - - - -
Output require:
time in epoch,host,Resp Code,count
1485129842,101.100.144.34,200,4000
1485129842,101.101.144.34,404,1889
what I have so far but nothing near what I am trying to achieve:
tail -100 httpd_access_*.log | awk '{print $5 " " $2 " " $10}' | sort | uniq

awk 'BEGIN{
# print header
print "time in epoch,host,Resp Code,count"
# prepare month conversion array
split( "Jan Feb Mar Apr May Jun Jui Aug Sep Oct Nov Dec", tmp)
for (i in tmp) M[tmp[i]]=i
}
{
#prepare time conversion for mktime() using array and substitution
# from 14/Feb/2017:11:31:20 +1100
# to YYYY MM DD HH MM SS [DST]
split( $5, aT, /[:/[:blank:]]/)
t = $5; sub( /^.*:|:/, " ", t)
t = aT[3] " " M[aT[2]] " " aT[1] t
# count (not clear if it s this to count due to time changing
Count[ sprintf( "%s, %s, %s", mktime( t), $2, $10)]++
}
END{
# disply the result counted
for( e in Count) printf( "%s, %d\n", e, Count[e])
}
' httpd_access_*.log
count is to be more specificaly describe to be sure about the criteria to count
need GNU awk for mktime() function
assume time is always in this format
no secure nor filter (not the purpose of this)

Sure the pure AWK based solution above would be much faster, and more complete.
But can also be done in smaller steps:
First get date and convert it to EPOCH:
$ dt=$(awk '{print $5,$6}' file.log)
$ ep=$(date -d "$(sed -e 's,/,-,g' -e 's,:, ,' <<<"${dt:1:-1}")" +"%s")
$ echo "$ep"
1487032280
Since now you have the epoch date in the bash var $ep, you can continue with your initiall awk like this:
$ awk -v edt=$ep '{print edt","$2","$10}' file.log
1487032280,101.100.144.34,200
If you want a header , you can just print one before last awk with a simple echo.

parse httpd log in bash

my httpd log has the following format
123.251.0.000 - - [05/Sep/2014:18:19:24 -0700] "GET /myapp/MyService?param1=value1&param2=value2&param3=value3 HTTP/1.1" 200 15138 "-" "-"
I need to extract the following fields and display on a line:
IP value1 httpResponseCode(eg.200), dataLength
what's the most efficient way to do this in bash?

As you're using Linux, chances are that you also have GNU awk installed. If so:
$ awk 'match ($7, /param1=([^& ]*)/, m) { print $1, m[1], $9",", $10 }' http.log
gives:
123.251.0.000 value1 200, 15138
This works as long as value1 hasn't got an ampersand or space in it, which they shouldn't if the request has been escaped correctly.

$ cat tmp.txt
123.251.0.000 - - [05/Sep/2014:18:19:24 -0700] "GET /myapp/MyService?param1=value1&param2=value2&param3=value3 HTTP/1.1" 200 15138 "-" "-"
$ awk '{ print "IP", $1, $9, $10 }' tmp.txt
IP 123.251.0.000 200 15138

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Retrieving User Agent from Apache/Nginx Access.log - linux

Related

Parsing Apache logs with bash

AWK command to print the pattern as below from pattern provided

Linux Shell Script: How to compare a specific field in a text document with specific text in an if statement

Using awk to generate report from apache http logs

parse httpd log in bash

Categories

Resources