Counting requests and status codes per URI in a webserver log - linux

Given a typical webserver log file that contains a mixture of absolute URLs, relative URLs, human requests and bots (some sample lines):
112.77.167.177 - - [01/Apr/2016:22:40:09 +1100] "GET /bad-credit-loans/abc/ HTTP/1.1" 200 7532 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
189.181.124.177 - - [31/Mar/2016:23:10:47 +1100] "GET /build/assets/css/styles-1a879e1b.css HTTP/1.1" 200 31654 "https://www.abc.com.au/customer-reviews/" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_2_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13D15 Safari/601.1"
110.76.15.146 - - [01/Apr/2016:00:25:09 +1100] "GET http://www.abc.com.au/car-loans/low-doc-car-loans/ HTTP/1.1" 301 528 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
I'm looking to list all the URI's requested with status code (200, 302 etc.) and total count of requests i.e.
http://www.abc.com.au 301 3,900
/bad-credit-loans/abc/ 200 123
/bad-credit-loans/abc/ 302 7
Were it not for the presence of the varying IP addresses, timestamps, referring URLs, and user agents, I would be able to combine uniq and sort in the standard fashion. Or if I knew all the URLs in advance, I could simply loop over each URL-status code combo with grep in its simplest form.
How do we disregard the varying items (user agents, timestamps etc.) and extract just the URLs and their frequency of status code?

You should just recognize taht the interesting parts are always on constant filed positions (with respect to space separated fields).
URL is at position 7 and status code is at position 9.
The rest is trivial. You may e.g. use:
awk '{sum[$7 " " $9]++;tot++;} END { for (i in sum) { printf "%s %d\n", i, sum[i];} printf "TOTAL %d\n", tot;}' LOGFILES
And then sort using sort the result if you need the outpout sorted.

Related

Can Microsoft Translator API translate text with special characters?

I am trying to use Microsoft Translator API to translate text from Polish to any other language. In Polish, there are a couple of special characters like "ą", "ś", "ż" etc. When I send the HTTP request with no special characters:
POST /translate?api-version=3.0&from=pl&to=en HTTP/1.1
Ocp-Apim-Subscription-Key: ********
Ocp-Apim-Subscription-Region: ******
Content-Length: 21
Host: api.cognitive.microsofttranslator.com
Connection: close
User-Agent: Apache-HttpClient/4.5.10 (Java/15.0.2)
Accept-Encoding: gzip, deflate
[{"Text": "Gramatyka"}]
I receive a correct translation:
[{"translations":[{"text":"grammar","to":"en"}]}]
However, it is likely that a Polish word or sentence contains special characters:
POST /translate?api-version=3.0&from=pl&to=en HTTP/1.1
Ocp-Apim-Subscription-Key: ********
Ocp-Apim-Subscription-Region: ********
Content-Length: 21
Host: api.cognitive.microsofttranslator.com
Connection: close
User-Agent: Apache-HttpClient/4.5.10 (Java/15.0.2)
Accept-Encoding: gzip, deflate
[{"Text": "Roślina"}]
This request results in error code 400000:
{"error":{"code":400000,"message":"One of the request inputs is not valid."}}
If I change the special characters to standard ones (like change "ś" into "s"), the API does not give a proper translation. For example:
[{"Text": "Roslina"}]
results in:
[{"translations":[{"text":"Roslina","to":"en"}]}]
Whereas "roślina" should translate to "plant".
This problem applies to other languages too. For example German:
[{"Text": "Wörterbuch"}]
results in an 400000 error as well.
Has anyone found a solution to this?
Did you try checking the language detect score, to just understand if it is taking it as Polish. Can you try without "From" attribute. Make sure you put all headers.
curl -X POST "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0&to=zh-Hans" -H "Ocp-Apim-Subscription-Key: " -H "Content-Type: application/json; charset=UTF-8" -d "[{'Text':'Hello, what is your name?'}]"

Read the log files and get the entries between two dates

I want to extract some information from the access log file that matches a keyword and between two dates. For ex. I want to find log entries between two dates that contains text "passwd". For now, I am using the following command but not getting the correct results:
fgrep "passwd" * | awk '$4 >= "[20/Aug/2017" && $4 <= "[22/Aug/2017"'
Date format is [22/Feb/2017:17:28:42 +0000].
I have searched and look at this post too extract data from log file in specified range of time but not exactly understand how to use it.
Edits:
Following are the example entries of the access log files,
xxx-access_log:xx.xx.xx.xx - - [22/Feb/2017:17:30:02 +0000] "GET /cms/usr/extensions/get_tree.inc.php?GLOBALS[root_path]=/etc/passwd%00 HTTP/1.1" 404 39798
xxx-access_log:xx.xx.xx.xx - - [22/Feb/2017:17:31:12 +0000] "GET /cgi-bin/libs/smarty_ajax/index.php?_=&f=update_intro&page=../../../../../../../../../../../../../../../../../../etc/passwd%00 HTTP/1.1" 404 30083
xxx-access_log:xx.xx.xx.xx - - [22/Feb/2017:17:31:19 +0000] "GET /download/libs/smarty_ajax/index.php?_=&f=update_intro&page=../../../../../../../../../../../../../../../../../../etc/passwd%00 HTTP/1.1" 404 27982
xxx-access_log:xx.xx.xx.xx - - [22/Feb/2017:17:31:24 +0000] "GET /sites/libs/smarty_ajax/index.php?_=&f=update_intro&page=../../../../../../../../../../../../../../../../../../etc/passwd%00 HTTP/1.1" 404 35256
xxx-access_log:xx.xx.xx.xx - - [22/Feb/2017:17:28:32 +0000] "GET /modx/manager/media/browser/mcpuk/connectors/php/Commands/Thumbnail.php?base_path=/etc/passwd%00 HTTP/1.1" 404 6956
xxx-access_log:xx.xx.xx.xx - - [22/Feb/2017:17:28:42 +0000] "GET /modx/manager/media/browser/mcpuk/connectors/php/Commands/Thumbnail.php?base_path=/etc/passwd%00 HTTP/1.1" 404 6956
Thanks for help in advance!
The link you quoted would be used if you know 2 specific strings that appear in your log file. That command will search for the first string and display all lines until it finds the second string and then stops.
In your case, if you want generic date manipulation, you might be better off with perl and one of the date/time modules. Most (if not all) of those have built-in date comparison routines, and many of them will take the date in almost any format imaginable ... and the ones that don't typically provide the ability to specify the date format.
(If you're just using dates and not using times, then Date::EzDate is my favorite, and probably the easiest to learn and implement quickly.)
Shell commands are probably not going to do a good job of date manipulation.

CUPS uses the wrong DISPLAY parameter

We are using the Xerox Linux drivers to print on our multifunction printer. Basically, when you print the driver opens a pop-up window to let you choose different printing options, then calls lp for printing.
This works pretty well on single user computer, but when many users are logged-in at the same time on the machine, the driver doesn't know which DISPLAY to use (:0, :1, :2, etc.). Thus when printing, the pop-up appears on :0 even though a user can be on :1 or :2.
When it comes to printing, the printing subsystem runs as an OS user (lp on Debian). This OS user doesn't have an X session and thus no DISPLAY value. Since DISPLAY isn't set, the driver assumes :0 being the typical single-user client display. Therefore when using the User Switching mechanism, CUPS does not forward the requesting user's DISPLAY so the driver assumes the :0 display if not specified. This causes user2's driver interface to be sent to user1's display.
Here is a snippet of the log when printing. You can see I called the process with tech, but lp is the one printing:
localhost - tech [06/May/2016:15:06:42 -0400] "POST / HTTP/1.1" 200
362 Create-Printer-Subscriptions successful-ok
localhost - lp [06/May/2016:15:06:55 -0400] "POST /printers/xeroxtq1
HTTP/1.1" 200 346 Create-Job successful-ok
localhost - lp [06/May/2016:15:06:55 -0400] "POST /printer /xeroxtq1
HTTP/1.1" 200 33861 Send-Document successful-ok
I am not looking for a full walkthrough solution here (if you have one I won't spit on it though), but for some hints on what I should try to do. I've thought of:
1 - Disabling user switching in GNOME3, but this is a last resort solution since it is quite useful for users
2 - Forcing CUPS to call lp with the -o DISPLAY option, grepped from the user that called the process. If this were feasible, it would be quite nice.
3 - Force GNOME3 to show currently used user on :0 and move idle ones to other displays.
I have no idea how #2 could be done and I'm not sure if #3 is feasible.
I've already tweaked GNOME3 to log off users that are idle more than 30min, but it's not enough to solve the problem.
Any help?

Danish date is not formatted correctly in Xpages

In our application we have implemented that all dates and values should be presented using the browsers locale.
However when selecting Danish as the locale/language in any web browser the date formatting is wrong.
We see no errors for English, Swedish, Norwegian formatting, only for Danish.
The dates are formatted as "20/08/15" but should be "20-08-2015"
The server is a Domino 9.0.1 version using Server Locale and when testing the locale output I see that it is serving "da". When changing to Browser Locale on server the setting do not change the date formatting.
This issue has been reported on our servers in different countries.
I have tried to locate an explanation and/or answer to our problem but failed.
The application has no locale specific formatting on any fields, view columns… and we'd like to keep it that way. Our application is run in different countries so not controlling the locale formatting is our preferred way. However we'd like to present the dates and numbers in the language specific correct way.
We do not explicitly use any Dojo components, only plain date fields and view columns in a view panel. We do not have any International Options set.
I have tried to set the locale as #Sven Hasselbach answer in another question but failed. Haven't tried his Xsnippet…
an example of header:
GET /demo/tradesec.nsf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, sdch
Accept-Language: da,sv;q=0.8,no;q=0.6,en-US;q=0.4,en;q=0.2,nl;q=0.2
DNT: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.37 Safari/537.36
X-Chrome-UMA-Enabled: 1
X-Client-Data: CJe2yQEIo7bJAQicksoBCOeUygEI/ZXKAQi8mMoB
HTTP/1.1 200 OK
Connection: Keep-Alive
Content-Encoding: gzip
Content-Length: 8956
Content-Type: text/html;charset=UTF-8
Date: Mon, 17 Aug 2015 07:59:43 GMT
Expires: -1
Keep-Alive: timeout=10, max=100
Tradechannel: Work_and_fun_professionally_done
X-Pad: avoid browser bug
Please advice, thanks!
/M
XPages is using the ICU4J library for date formatting. That library is using '/' as the separator for the Danish short date format.
So code like this:
com.ibm.icu.text.DateFormat.getDateInstance(
com.ibm.icu.text.DateFormat.SHORT,
new java.util.Locale("da")).toPattern()
gives date patterns like:
en: M/d/yy
da: dd/MM/yy
sv: yyyy-MM-dd
nb: dd.MM.yy
You might try using the long date format instead:
da (long): d. MMM yyyy
output: 17. aug 2015
da (medium): dd/MM/yyyy
output: 17/08/2015
by setting dateStyle="long" on the converter.
Or if you do need to override the language-specific pattern for Danish then the code would be like:
<xp:viewColumn columnName="_MainTopicsDate" id="viewColumn3">
<xp:viewColumnHeader value="Date" id="viewColumnHeader3"></xp:viewColumnHeader>
<xp:this.converter>
<xp:convertDateTime dateStyle="short"
pattern="${javascript: ('da' == context.getLocale().getLanguage())?
'd-MM-yyyy': null}">
</xp:convertDateTime>
</xp:this.converter>
</xp:viewColumn>
Just a quick thing to check - are there any settings in the browser that you are using that could make any trouble here? Can you confirm that it works correctly using a non-XPages webpage with the correct locale?
I know I have had issues with browsers trying to be "clever" in other situations - so I think it would be a good idea to try and establish if the browser or the XPage is the culprit ;-)
/John

binary protocols v. text protocols

does anyone have a good definition for what a binary protocol is? and what is a text protocol actually? how do these compare to each other in terms of bits sent on the wire?
here's what wikipedia says about binary protocols:
A binary protocol is a protocol which is intended or expected to be read by a machine rather than a human being (http://en.wikipedia.org/wiki/Binary_protocol)
oh come on!
to be more clear, if I have jpg file how would that be sent through a binary protocol and how through a text one? in terms of bits/bytes sent on the wire of course.
at the end of the day if you look at a string it is itself an array of bytes so the distinction between the 2 protocols should rest on what actual data is being sent on the wire. in other words, on how the initial data (jpg file) is encoded before being sent.
Binary protocol versus text protocol isn't really about how binary blobs are encoded. The difference is really whether the protocol is oriented around data structures or around text strings. Let me give an example: HTTP. HTTP is a text protocol, even though when it sends a jpeg image, it just sends the raw bytes, not a text encoding of them.
But what makes HTTP a text protocol is that the exchange to get the jpg looks like this:
Request:
GET /files/image.jpg HTTP/1.0
Connection: Keep-Alive
User-Agent: Mozilla/4.01 [en] (Win95; I)
Host: hal.etc.com.au
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
Response:
HTTP/1.1 200 OK
Date: Mon, 19 Jan 1998 03:52:51 GMT
Server: Apache/1.2.4
Last-Modified: Wed, 08 Oct 1997 04:15:24 GMT
ETag: "61a85-17c3-343b08dc"
Content-Length: 60830
Accept-Ranges: bytes
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: image/jpeg
<binary data goes here>
Note that this could very easily have been packed much more tightly into a structure that would look (in C) something like
Request:
struct request {
int requestType;
int protocolVersion;
char path[1024];
char user_agent[1024];
char host[1024];
long int accept_bitmask;
long int language_bitmask;
long int charset_bitmask;
};
Response:
struct response {
int responseType;
int protocolVersion;
time_t date;
char host[1024];
time_t modification_date;
char etag[1024];
size_t content_length;
int keepalive_timeout;
int keepalive_max;
int connection_type;
char content_type[1024];
char data[];
};
Where the field names would not have to be transmitted at all, and where, for example, the responseType in the response structure is an int with the value 200 instead of three characters '2' '0' '0'. That's what a text based protocol is: one that is designed to be communicated as a flat stream of (usually human-readable) lines of text, rather than as structured data of many different types.
Here's a kind-of cop-out definition:
You'll know it when you see it.
This is one of those cases where it is very hard to find a concise definition that covers all corner cases. But it is also one of those cases where the corner cases are completely irrelevant, because they simply do not occur in real life.
Pretty much all protocols that you will encounter in real life will either look like this:
> fg,m4wr76389b zhjsfg gsidf7t5e89wriuotu nbsdfgizs89567sfghlkf
> b9er t8ß03q+459tw4t3490ß´5´3w459t srt üßodfasdfäasefsadfaüdfzjhzuk78987342
< mvclkdsfu93q45324äö53q4lötüpq34tasä#etr0 awe+s byf eart
[Imagine a ton of other non-printable crap there. One of the challenges in conveying the difference between text and binary is that you have to do the conveying in text :-)]
Or like this:
< HELLO server.example.com
> HELLO client.example.com
< GO
> GETFILE /foo.jpg
< Length: 3726
< Type: image/jpeg
< READY?
> GO
< ... server sends 3726 bytes of binary data ...
> ACK
> BYE
[I just made this up on the spot.]
There's simply not that much ambiguity there.
Another definition that I have sometimes heard is
a text protocol is one that you can debug using telnet
Maybe I am showing my nerdiness here, but I have actually written and read e-mails via SMTP and POP3, read usenet articles via NNTP and viewed web pages via HTTP using telnet, for no other reason than to see whether it would actually work.
Actually, while writing this, I kinda caught the fever again:
bash-4.0$ telnet smtp.googlemail.com 25
Trying 74.125.77.16...
Connected to googlemail-smtp.l.google.com.
Escape character is '^]'.
< 220 googlemail-smtp.l.google.com ESMTP Thu, 15 Apr 2010 19:19:39 +0200
> HELO
< 501 Syntactically invalid HELO argument(s)
> HELO client.example.com
< 250 googlemail-smtp.l.google.com Hello client.example.com [666.666.666.666]
> RCPT TO:Me <Me#Example.Com>
< 503 sender not yet given
> SENDER:Me <Me#Example.Com>
< 500 unrecognized command
> RCPT FROM:Me <Me#Example.Com>
< 500 unrecognized command
> FROM:Me <Me#Example.Com>
< 500-unrecognized command
> HELP
< 214-Commands supported:
< 214 AUTH HELO EHLO MAIL RCPT DATA NOOP QUIT RSET HELP ETRN
> MAIL FROM:Me <Me#Example.Com>
< 250 OK
> RCPT TO:You <You#SomewhereElse.Example.Com>
< 250 Accepted
> DATA
< 354 Enter message, ending with "." on a line by itself
> From: Me <Me#Example.Com>
> To: You <You#SomewhereElse.Example.Com>
> Subject: Testmail
>
> This is a test.
> .
< 250 OK id=1O2Sjq-0000c4-Qv
> QUIT
< 221 googlemail-smtp.l.google.com closing connection
Connection closed by foreign host.
Damn, it's been quite a while since I've done this. Quite a few errors in there :-)
Examples of binary protocols: RTP, TCP, IP.
Examples of text protocols: SMTP, HTTP, SIP.
This should allow you to generalise to a reasonable definition of binary vs text protocols.
Hint: just skip to the example sections, or the diagrams. They serve to illustrate Tyler's rocking answer.
As most of you suggested we can't differentiate whether the protocol is Binary or text simply by looking at the content on the wire
AFIK
Binary protocol - Bits are boundary
Order is very critical
Eg., RTP
First two bits are version
Next bit is MarkUp bit
Text protocol - Delimiters specific to protocol
Order of the fields is not important
Eg., SIP
One more is, in binary protocol, we can split a byte, i.e., a single bit might have a specific individual meaning; While in a text protocol minimum meaningful unit is BYTE. You can't split a byte.
Both uses different char set, the text one, use a reduced char set, the binary includes all it can, not only "letters" and "numbers", (that's why wikipedia says "human being")
o be more clear, if I have jpg file how would that be sent through a binary protocol and how >through a text one? in terms of bits/bytes sent on the wire of course.
you should read this Base64
any coments are apprecited, I am trying to get to the essence of things here.
I think the essence for narrowing the charset, is narrowing the complexity, and reach portability, compatibility. It's harder to arrange and agree with many to respect a Wide charset, (or a wide whatever). The Latin/Roman alphabet and the Arabic numerals are worldwide known. (There are of course other considerations to reduce the code, but that's a main one)
Let say in binary protocols the "contract" between the parts is about bits, first bit mean this, second that, etc.. or even bytes (but with the freedom of use the charset without thinking in portability) for example in privated closed system or (near hardware standars), however if you design a open system you have to take account how your codes will be represented in a wide set of situations, for example how it will be represented in a machine at other side of world?, so here comes the text protocols where the contract will be as standar as posible. I have designed both and that were the reasons, binary for very custom solutions and text for open or/and portable systems.
How can we send an image file in SOAP: Click here
This shows that binary data is attached as such [ATTACHMENT] and its reference is saved in SOAP message.
So, The protocol is text based and data[Image] is binary attachment whose encoding is not relevant
Thus, SOAP is text protocol due to the way we specify Soap headers and not actual data encoded in it.
I think you got it wrong.
It's not the protocol that determines how data looks on the "wire", but it's the data type that determine which protocol to use to transmit it.
Take tcp socket for instance, a jpeg file will be sent and received with a binary protocol 'cause it's binary data (not human readable, bytes that go among the 32-126 ascii range), but you can send / recv a text file with both protocols and you wouldn't notice the difference.
Text protocol can be self-explanatory and extensive.
It's self-explanatory because the message includes the field names just in the message itself. You cannot understand which value means in the message of binary protocol if you don't refer to the protocol specification.
It's extensive means HTTP as a text protocol just make simple rules but you can extend the data structure by freely adding new headers or by changing the content type to transport different payloads. And the headers are the meta data and have the capability of negotiation and automatically adaption.

Resources