How to parse a log separate by comma in which one of the fields contains a comma with Grok - logstash-grok

I have some logs which look like this :
2019-10-24 15:14:46,183 [http-nio-8080-exec-2] [bf7ccfa6-e24f-4854-9b1f-753a7886e351, 10.0.22.13, /project/api/traductions/ng/fr, GET, , Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36, 10.128.22.82, 10.255.0.3, 10.0.10.56, 10.0.22.27] [] INFO ContextUserRepositorySpring - This is my message, with any characters in it. (like those $ - )
2019-10-24 15:21:51,699 [http-nio-8080-exec-8] [, , , , , , ] [] INFO SecurityAuditEventListener - AuditEvent onApplicationEvent: org.springframework.boot.actuate.audit.listener.AuditApplicationEvent[source=AuditEvent [timestamp=2019-10-24T13:21:51.699813Z, principal=EOFK, type=AUTHENTIFICATION_SUCCESS, data={isauthenticated=true, authTpye=Authentification JWT, requestPath=/api/efsef/subscribe}]]
2019-10-24 15:23:22,578 [http-nio-8080-exec-2] [32d1189f-eac5-47ad-b52e-33323e07c4d7, 10.0.22.13, /webai/api/environnement/, GET, , Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0, 10.128.20.145, 10.255.0.3, 10.0.10.56, 10.0.22.27] [] INFO SecurityAuditEventListener - AuditEvent onApplicationEvent: ServletRequestHandledEvent: url=[/project/api/environnement/]; client=[10.0.22.13]; method=[GET]; servlet=[dispatcherServlet]; session=[null]; user=[null]; time=[2ms]; status=[OK]
I have a grok parse like that:
"^%{TIMESTAMP_ISO8601:date}\s+\[%{EMAILLOCALPART:thread}\]\s\[(?:|%{UUID:ident}),\s(?:|%{IP:remoteHost}),\s(?:|%{URIPATH:uri}),\s(?:|%{WORD:verb}),\s(?:|%{NOTSPACE:query}),\s(?<userAgent>[^,]*),\s(?:|%{IP:ip1},\s%{IP:ip2},\s%{IP:ip3},\s%{IP:ip4})\]\s\[\]\s%{LOGLEVEL:loglevel}\s+%{NOTSPACE:logger}\s\-\s%{GREEDYDATA:logmessage}"
In this case, the second and third lines are well parsed. But not the first one, because of this comma present. (KHTML, like Gecko). And the regex fails at this level : (?<userAgent>[^,]*),\s
Now I'm wondering if there is a way to get all the logs well parsed. With the current logs. Just by changing the grok matching regex...

Related

How to compile several regex into one

Good morning, I need to compile a several regular expressions into one pattern
Regular expressions are like this:
reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>htt[p|ps]:.*?)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'
Regular expressions are written for strings from apache access.log:
109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374
Tried to compile it with code like this:
some_pattern = re.compile(reg_ip.join(reg_meth).join(reg_status))
Obviously it doesn't work that way. How to do it right?
You need some glue between regexes.
You have two options:
join regexes via alternation: regex1|regex2|regex3|... and use global search
add missing glue betweek regexes: for example, between reg_status and reg_url you may need to add r'[^"]+' to skip the next number
The problem with alternation is that you could find the regexes at any place. So you could find for example the word post (or a number) inside an url.
So for me, the second option is better.
This is the glue I would use:
import re
reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
#reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
#reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>https?:[^"]+)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'
some_pattern = re.compile(reg_meth + r'\s+[^]]+\s*"' + reg_status + r'[^"]+' + reg_url + r'\s*"[^"]+"\s*' + reg_rt)
print(some_pattern)
line = '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374'
print(some_pattern.search(line))
For the glue, these are the pieces I used:
\s* : Capture any 'whitespace' 0 or more times
\s+ : Capture any 'whitespace' 1 or more times
[^X]+ : Where 'X' is some character; Capture any non-X characters one or more times
By the way:
This htt[p|ps] is not correct. You can simply use https? instead. Or if you want to do it with groups: htt(p|ps) or http(?:p|ps) (Last one is a non-capturing group, which is preferred if you dont want to capture its content)

Is there a way to replace all spaces outside of pair of two characters in string?

I have a log file, that I need to make into a csv. For that I need to replace all spaces with | character.
My code so far:
with open('Log_jeden_den.log', 'r') as f:
for line in f:
line = re.sub(r'[ ]+(?![^[]*\])', '|', line)
An example line of this file looks like this:
123.456.789.10 - - [20/Feb/2020:06:25:16 +0100] "GET /android-icon-192x192.png HTTP/1.1" 200 4026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
As you can see, there are spaces inside the [] and "" pairs. Those I do not want to replace. Only the spaces outside of them.
I can do this for [], with this regex [ ]+(?![^[]*\]), but if I do the same for "" with similar regex [ ]+(?![^"]*\"), it does not work. I tried multiple variations of this regex, none of them worked. What am I missing?
If I work this out, then I would also need to combine those regexes, so I only replace the spaces outside of both character pairs. That would be my second question.
EDIT: Output of my example line as requested:
123.456.789.10|-|-|[20/Feb/2020:06:25:16 +0100]|"GET|/android-icon-192x192.png|HTTP/1.1"|200|4026|"-"|"Mozilla/5.0|(Windows|NT|6.1;|WOW64;|Trident/7.0;|rv:11.0)|like|Gecko"
EDIT2: This would be my desired output:
123.456.789.10|-|-|[20/Feb/2020:06:25:16 +0100]|"GET /android-icon-192x192.png HTTP/1.1"|200|4026|"-"|"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
You may use
with open('Log_jeden_den_out.log', 'w') as fw:
with open('Log_jeden_den.log', 'r') as fr:
for line in fr:
fw.write( re.sub(r'(\[[^][]*]|"[^"]*")|\s', lambda x: x.group(1) if x.group(1) else "|", line) )
Details
(\[[^][]*]|"[^"]*") - Matches and captures into Group 1 any substring between the closest [ and ] or " and "
| - or
\s - just matches any whitespace char in any other context
The lambda x: x.group(1) if x.group(1) else "|" replacement puts back Group 1 if it matched, else, replaces with a pipe.

how do i scrape the web pages where page number is not shown in URL, or no link is provided

Hi i need to scrape the following web pages, but when it comes to pagination I am not able to fetch URL with page number.
https://www.nasdaq.com/market-activity/commodities/ho%3Anmx/historical
You could go through the api to iterate through each page (or in the case of this api an offset number - as it tells you how many total records there are). Take the total records, then divide by the limit set (and use math.ceiling to round up. Then iterate the range from 1, to that number using the multiple as an offset of the limit as a parameter).
Or, just easier, adjust the limit to something higher, and get it in one request:
import requests
from pandas.io.json import json_normalize
url = 'https://api.nasdaq.com/api/quote/HO%3ANMX/historical'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
payload = {
'assetclass': 'commodities',
'fromdate': '2020-01-05',
'limit': '9999',
'todate': '2020-02-05'}
data = requests.get(url, headers=headers,params=payload).json()
df = json_normalize(data['data']['tradesTable']['rows'])
Output:
print (df.to_string())
close date high low open volume
0 1.5839 02/04/2020 1.6179 1.5697 1.5699 66,881
1 1.5779 02/03/2020 1.6273 1.5707 1.6188 62,146
2 1.6284 01/31/2020 1.6786 1.6181 1.6677 68,513
3 1.642 01/30/2020 1.699 1.6305 1.6952 70,173
4 1.7043 01/29/2020 1.7355 1.6933 1.7261 69,082
5 1.7162 01/28/2020 1.7303 1.66 1.674 79,852
6 1.6829 01/27/2020 1.7305 1.6598 1.7279 97,184
7 1.7374 01/24/2020 1.7441 1.7369 1.7394 80,351
8 1.7943 01/23/2020 1.7981 1.7558 1.7919 89,084
9 1.8048 01/22/2020 1.811 1.7838 1.7929 90,311
10 1.8292 01/21/2020 1.8859 1.8242 1.8782 53,130
11 1.8637 01/17/2020 1.875 1.8472 1.8669 79,766
12 1.8647 01/16/2020 1.8926 1.8615 1.8866 99,020
13 1.8822 01/15/2020 1.9168 1.8797 1.9043 92,401
14 1.9103 01/14/2020 1.9224 1.8848 1.898 62,254
15 1.898 01/13/2020 1.94 1.8941 1.9366 61,328
16 1.9284 01/10/2020 1.96 1.9262 1.9522 67,329
17 1.9501 01/09/2020 1.9722 1.9282 1.9665 73,527
18 1.9582 01/08/2020 1.9776 1.9648 1.9759 110,514
19 2.0324 01/07/2020 2.0392 2.0065 2.0274 72,421
20 2.0339 01/06/2020 2.103 2.0193 2.0755 87,832

Sending data from Hono to Ditto

Eclipse Hono and Eclipse Ditto have been connected successfully. And when I try to send the data from the Hono I will get 202 Accepted response as shown below.
(base) vignesh#nb907:~$ curl -X POST -i -u sensor9200#tenantSensorAdaptersss:mylittle -H 'Content-
Type: application/json' -d '{"temp": 23.07, "hum": 45.85122}' http://localhost:8080/telemetry
HTTP/1.1 202 Accepted
content-length: 0
But when I checked for the digital twin value using localhost:8080/api/2/testing.ditto:9200 - It is not getting updated.
I came across the error while enquiring the logs.
connectivity_1 | 2019-10-14 15:18:26,273 INFO [ID:AMQP_NO_PREFIX:TelemetrySenderImpl-7]
o.e.d.s.c.m.a.AmqpPublisherActor akka://ditto-
cluster/system/sharding/connection/27/Amma123465/pa/$a/c1/amqpPublisherActor2 - Response dropped,
missing replyTo address: UnmodifiableExternalMessage [headers={content-
type=application/vnd.eclipse.ditto+json, orig_adapter=hono-http, orig_address=/telemetry,
device_id=9200, correlation-id=ID:AMQP_NO_PREFIX:TelemetrySenderImpl-7}, response=true, error=true,
authorizationContext=null, topicPath=ImmutableTopicPath [namespace=unknown, id=unknown, group=things,
channel=twin, criterion=errors, action=null, subject=null, path=unknown/unknown/things/twin/errors],
enforcement=null, headerMapping=null, sourceAddress=null, payloadType=TEXT, textPayload=
{"topic":"unknown/unknown/things/twin/errors","headers":{"content-
type":"application/vnd.eclipse.ditto+json","orig_adapter":"hono-
http","orig_address":"/telemetry","device_id":"9200","correlation-
id":"ID:AMQP_NO_PREFIX:TelemetrySenderImpl-7"},"path":"/","value":
{"status":400,"error":"json.field.missing","message":"JSON did not include required </path>
field!","description":"Check if all required JSON fields were set."},"status":400}, bytePayload=null']
gateway_1 | 2019-10-14 15:19:47,927 WARN [b9774050-48ae-45c4-a937-68a70f8defe5]
o.e.d.s.g.s.a.d.DummyAuthenticationProvider - Dummy authentication has been applied for the following
subjects: nginx:ditto
gateway_1 | 2019-10-14 15:19:47,949 INFO [b9774050-48ae-45c4-a937-68a70f8defe5]
o.e.d.s.m.c.a.ConciergeForwarderActor akka://ditto-cluster/user/gatewayRoot/conciergeForwarder -
Sending signal with ID <testing.ditto:9200> and type <things.commands:retrieveThing> to concierge-
shard-region
gateway_1 | 2019-10-14 15:19:48,044 INFO [b9774050-48ae-45c4-a937-68a70f8defe5]
o.e.d.s.g.e.HttpRequestActor akka://ditto-cluster/user/$C - DittoRuntimeException
<things:precondition.notmodified>: <The comparison of precondition header 'if-none-match' for the
requested Thing resource evaluated to false. Expected: '"rev:1"' not to match actual: '"rev:1"'.>.
I have set the all the json fields. But not sure what I am missing.
I also can see this in the log
nginx_1 | 172.18.0.1 - ditto [14/Oct/2019:13:19:48 +0000] "GET
/api/2/things/testing.ditto:9200 HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
Please let me know if I am missing something.
Did you send the message in Ditto protocol or applied payload transformation?
Looks like a duplicate of Connecting Eclipse Hono to Ditto - "description":"Check if all required JSON fields were set."},"status":400}" Error where you had the same problem and error before.

How to read all ip address from xxx.log file and print their count?

I am JasperReports Developer, but my manager moved me to work on Python 3 project to read the IP address from 'fileName.log' file and want me to print the count of IP address if one IP watched my Video more than one time.
I am very new to Python 3. Please help me with this problem.
My file as below:
66.23.64.12 - - [06/Nov/2014:19:10:38 +0600] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
64.24.65.93 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
78.849.65.62 - - [06/Nov/2014:19:12:14 +0600] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
78.849.65.62 - - [06/Nov/2014:19:12:14 +0600] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
98.449.65.19 - - [06/Nov/2014:19:10:38 +0600] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
54.49.65.03 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
54.49.65.03 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
45.79.65.62 - - [06/Nov/2014:19:12:14 +0600] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Output as below:
IP Count
98.449.65.19 2
54.49.65.03 4
Here is one method, it involves storing the IP address in a dictionary, which may be useful depending on what else you would like to do with the data.
# Read in the text file
with open('fileName.log','r') as f:
lines = f.readlines()
data = {}
for line in lines:
# Split the line each time a space appears, and take the first element (the IP address)
print(line)
ipAddr = line.split()[0]
if ipAddr in data:
data[ipAddr]+=1
else:
data[ipAddr]=1
# Print counts of each IP address
print(' IP Count')
for key, val in data.items():
print(key, val)
Output:
IP Count
66.23.64.12 1
64.24.65.93 1
78.849.65.62 2
98.449.65.19 1
54.49.65.03 2
45.79.65.62 1

Resources