Ignoring requests to images etc. when greping server logs - linux

I'm looking to pull out various metrics from some server logs. The first is the total number of requests to just pages, not images, CSS files etc.
So I want to include requests like:
140.77.167.177 - - [01/Apr/2016:22:40:09 +1100] "GET /bad-credit-loans/abc/ HTTP/1.1" 200 7532 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
but ignore requests like:
158.165.213.180 - - [01/Apr/2016:23:00:55 +1100] "GET /assets/img/lenders/png/insurance.png HTTP/1.1" 200 17866 "https://www.example.au/lp/tradie-loans/?utm_source=facebook&utm_medium=cpc&utm_content=mobilead&utm_campaign=abcs/" "Mozilla/5.0 (Linux; Android 5.1.1; SM-G920I Build/LMY47X; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/48.0.2564.106 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/70.0.0.22.83;]"
grep "GET " | wc -l will get me all requests; how to I disregard those that are in a range (*.png, .css, .jpg and .js), and how do I extend this to ignore any file?

You can do:
grep -Ev '\.(png|jpg|css|js)' file.log

Related

Why would \u0000 (null characters) end up in an HTTP response

I'm using cURL (or Node request) and see \u0000 scattered throughout the response and don't understand why they appear.
I figured out I can remove them with response.body.replace(/\u0000/g, '') but would like to understand the source of it to see if there's maybe a better way.
I've played around with request headers but don't understand where these characters come from.
Furthermore, when I browse the site in my browser, I don't see them, and copying the request (chrome's Copy as cURL option) into a terminal I do see them.
Is there some request header or some other way I should be removing/detecting these unicode characters?
Example request headers using Node request:
{ 'Content-Type': '*/*; charset=utf-8',
accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
}
Example cURL (and my ~/.curlrc is empty):
curl 'http://example.com' -H 'Pragma: no-cache' -H 'DNT: 1' \
-H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.8,la;q=0.6' \
-H 'Upgrade-Insecure-Requests: 1' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
-H 'Cache-Control: no-cache' -H 'Connection: keep-alive' --compressed
The response headers:
HTTP/1.1 200 OK
Server: nginx
Date: Sat, 04 Nov 2017 14:50:13 GMT
Content-Type: text/plain
Last-Modified: Thu, 02 Nov 2017 19:02:13 GMT
Transfer-Encoding: chunked
Connection: close
ETag: W/"59fc6be1-41e2"
Content-Encoding: gzip

Sum up values in first column for all identical strings in the second column

Let's say I have the following 2 files with entries such as these (number, IP and User-agent):
30000 11.11.11.11 Dalvik/2.1.0 Linux
10000 22.22.22.22 GetintentCrawler getintent.com
5000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
3000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
1000 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo
and
6000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
3000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
2000 11.11.11.11 Dalvik/2.1.0 Linux
600 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo
500 22.22.22.22 GetintentCrawler getintent.com
I want to be able to sum up the first column for all identical IPs (the second column), while also keeping all the subsequent columns with the user-agent. Also, the final output should be sorted by first column.
So the result should basically look like this:
32000 11.11.11.11 Dalvik/2.1.0 Linux
10500 22.22.22.22 GetintentCrawler getintent.com
9000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
8000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
1600 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo
So far I came up with this, but I lose the whole user-agent string and I also feel that I'm overcomplicating things:
cat file1.txt file2.txt file3.txt | awk '{arr[$2]+=$1;} END {for (i in arr) print i, arr[i]}' | awk '{ print $2" "$1 }' | sort -rn
You can use this gnu-awk:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} {
p=$1; $1=""; a[$0]+=p} END{for (i in a) print a[i] i}' file1 file2
BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} is used to maintain order of keys in associative array.
Output:
32000 11.11.11.11 Dalvik/2.1.0 Linux
10500 22.22.22.22 GetintentCrawler getintent.com
8000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
9000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
1600 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo

Block User Agent when it is a number - htaccess

I've been receiving a lot of visits to my site from bad bots.
The pattern is this:
190.204.58.162 - - [20/Oct/2014:16:46:54 +0200] "GET / HTTP/1.0" 200 318 mysite.com "-" "881087" "-"
201.243.204.1 - - [20/Oct/2014:16:46:54 +0200] "GET / HTTP/1.0" 200 318 mysite.com "-" "442762" "-"
200.109.59.218 - - [20/Oct/2014:16:46:54 +0200] "GET / HTTP/1.0" 200 318 mysite.com "-" "717724" "-"
113.140.25.4 - - [20/Oct/2014:16:46:54 +0200] "GET / HTTP/1.1" 200 318 mysite.com "-" "360319" "-"
183.136.221.6 - - [20/Oct/2014:16:46:54 +0200] "GET / HTTP/1.1" 200 318 mysite.com "-" "989851" "-"
195.154.78.122 - - [20/Oct/2014:16:46:54 +0200] "GET / HTTP/1.0" 200 318 mysite.com "-" "122984" "-"
59.151.103.52 - - [20/Oct/2014:16:46:54 +0200] "GET / HTTP/1.1" 200 318 mysite.com "-" "375843" "-"
Different IP and different user-agent.
However, the user-agent is always a numeric and normally it is 6 characters long.
For example on the first line, the user-agent is "881087" instead of being something like "Chrome", "Opera", "Safari", etc.
Does anyone know how to block it via .htaccess?
Sure can block that depends on what platform php or .net.
Personally I would use isnumeric on the User Agent. If its numeric you might use return from jsp die(); in php or response.end for .net.
As far as htaccess you might try a regex on the user agent.
Please let me know if you want the exact script for any of the above.

apache web server logging - custom log files vs. general log file

When I don't specify a logfile in the virtual host sections of my conf-file the logs are written in the file specified in httpd.conf (=access_log).
A log-entry would look like this:
SOMEIP - - [22/Jan/2013:18:34:08 +0100] "GET / HTTP/1.1" 200 1752 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/SOMEIP Safari/537.17"
SOMEIP - - [22/Jan/2013:18:34:08 +0100] "GET /img/homepage_bg.png HTTP/1.1" 304 - "http://DOMAIN/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) C$
But when I define a log file in the virtual host section the new log file contains different information:
SOMEIP - - [22/Jan/2013:18:33:34 +0100] "GET / HTTP/1.1" 200 1752
SOMEIP - - [22/Jan/2013:18:33:34 +0100] "GET /img/homepage_bg.png HTTP/1.1" 304 -
i define the log file like this:
CustomLog logs/DOMAIN-access_log common
Why does a custom log contain less information than the general log where all virtual hosts log in by default?
You need to define the alias "common" with a log format that includes the user-agent.
LogFormat "%h %l %u %t \"%r\" %>s %b "%{User-agent}i" common
You didn't say what flavour of Linux you're using. Any decently configured Apache (for example the Debian-based ones like Ubuntu, Mint, etc.) will already have a fitting LogFormat containing the user-agent in their configuration. Look for all the lines matching LogFormat. You should find something like this:
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent
Just use the combined or even the vhost_combined parameter for your logfile:
CustomLog logs/DOMAIN-access_log combined
You should also look at the documentation for the Custom Log Formats.

Question about how to make a filter using script

I'm trying to make a filter on script to make this happen:
Before:
123.125.66.126 - - [05/Apr/2010:09:18:12 -0300] "GET / HTTP/1.1" 302 290
66.249.71.167 - - [05/Apr/2010:09:18:13 -0300] "GET /robots.txt HTTP/1.1" 404 290
66.249.71.167 - - [05/Apr/2010:09:18:13 -0300] "GET /~leonardo_campos/IFBA/Web_Design_Aula_17.pdf HTTP/1.1" 404 324
After:
[05/Apr/2010:09:18:12 -0300] / 302 290
[05/Apr/2010:09:18:13 -0300] /robots.txt 404 290
[05/Apr/2010:09:18:13 -0300] /~leonardo_campos/IFBA/Web_Design_Aula_17.pdf 404 324
If someone could help it would be great...
Thanks in advance !
Supporting all HTTP methods:
sed 's#.*\(\[[^]]*\]\).*"[A-Z]* \(.*\) HTTP/[0-9.]*" \(.*\)#\1 \2 \3#'
It seems a perfect work for "sed".
You can easily construct a pair of "s" replacement patterns to remove the unwanted pieces of lines.
sed is your friend here, with regexps.
sed 's/^\(\[.*\]\) "GET \(.*\) .*" \(.*\)$/\1 \2 \3/'
if your file structure is always like that, you can just use fields. no need complex regex
$ awk '{print $4,$5,$7,$9,$10}' file
[05/Apr/2010:09:18:12 -0300] / 302 290
[05/Apr/2010:09:18:13 -0300] /robots.txt 404 290
[05/Apr/2010:09:18:13 -0300] /~leonardo_campos/IFBA/Web_Design_Aula_17.pdf 404 324

Resources