How to compile several regex into one - python-3.x

Good morning, I need to compile a several regular expressions into one pattern
Regular expressions are like this:
reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>htt[p|ps]:.*?)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'
Regular expressions are written for strings from apache access.log:
109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374
Tried to compile it with code like this:
some_pattern = re.compile(reg_ip.join(reg_meth).join(reg_status))
Obviously it doesn't work that way. How to do it right?

You need some glue between regexes.
You have two options:
join regexes via alternation: regex1|regex2|regex3|... and use global search
add missing glue betweek regexes: for example, between reg_status and reg_url you may need to add r'[^"]+' to skip the next number
The problem with alternation is that you could find the regexes at any place. So you could find for example the word post (or a number) inside an url.
So for me, the second option is better.
This is the glue I would use:
import re
reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
#reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
#reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>https?:[^"]+)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'
some_pattern = re.compile(reg_meth + r'\s+[^]]+\s*"' + reg_status + r'[^"]+' + reg_url + r'\s*"[^"]+"\s*' + reg_rt)
print(some_pattern)
line = '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374'
print(some_pattern.search(line))
For the glue, these are the pieces I used:
\s* : Capture any 'whitespace' 0 or more times
\s+ : Capture any 'whitespace' 1 or more times
[^X]+ : Where 'X' is some character; Capture any non-X characters one or more times
By the way:
This htt[p|ps] is not correct. You can simply use https? instead. Or if you want to do it with groups: htt(p|ps) or http(?:p|ps) (Last one is a non-capturing group, which is preferred if you dont want to capture its content)

Related

My grok pattern is still slow how to optimise it further?

I’m curious what would be the most optimal solution for my pattern because this one still not the fastest I guess.
These are my log lines:
2021-07-09T11:48:32.328+0700 7fed98b56700 1 beast: 0x7fedfbac36b0: 10.111.111.111 - - [2021-07-09T11:48:32.328210+0700] "GET /glab-reg/docker/registry/v2/blobs/sha256/3f/3fe01ae49e6c42751859c7a8f8a0b5ab4362b215d07d8a0beaa802113dd8d9b8/data HTTP/1.1" 206 4339 - "docker-distribution/v3.0.0-gitlab (go1.14.7) aws-sdk-go/1.27.0 (go1.14.7; linux; amd64)" bytes=0-
2021-07-09T12:11:45.252+0700 7f36b0dd8700 1 beast: 0x7f374adb36b0: 10.111.111.111 - - [2021-07-09T12:11:45.252941+0700] "GET /glab-reg?list-type=2&max-keys=1&prefix= HTTP/1.1" 200 723 - "docker-distribution/v3.0.0-gitlab (go1.14.7) aws-sdk-go/1.27.0 (go1.14.7; linux; amd64)" -
2021-07-09T12:11:45.431+0700 7f360fc96700 1 beast: 0x7f374ad326b0: 10.111.111.111 - - [2021-07-09T12:11:45.431942+0700] "GET /streag/?list-type=2&delimiter=%2F&max-keys=5000&prefix=logs%2F&fetch-owner=false HTTP/1.1" 200 497 - "Hadoop 3.2.2, aws-sdk-java/1.11.563 Linux/5.4.0-70-generic OpenJDK_64-Bit_Server_VM/25.252-b09 java/1.8.0_252 scala/2.12.10 vendor/Oracle_Corporation" -
2021-07-09T12:12:00.738+0700 7fafc968d700 1 beast: 0x7fb0b5f0d6b0: 10.111.111.111 - - [2021-07-09T12:12:00.738060+0700] "GET /csder-prd-cae?list-type=2&max-keys=1000 HTTP/1.1" 200 279469 - "aws-sdk-java/2.16.50 Linux/3.10.0-1160.31.1.el7.x86_64 OpenJDK_64-Bit_Server_VM/25.292-b10 Java/1.8.0_292 scala/2.11.10 vendor/Red_Hat__Inc. io/async http/NettyNio cfg/retry-mode/legacy" -
2021-07-09T12:55:43.573+0700 7fa5329e3700 1 beast: 0x7fa4499846b0: 10.111.111.111 - - [2021-07-09T12:55:43.573351+0700] "PUT /s..prr//WHITELABEL-1/PAGETPYE-7/DEVICE-1/LANGUAGE-18/SUBTYPE-0/10236929 HTTP/1.1" 200 34982 - "aws-sdk-dotnet-coreclr/3.5.10.1 aws-sdk-dotnet-core/3.5.3.8 .NET_Core/4.6.26328.01 OS/Microsoft_Windows_6.3.9600 ClientAsync" -
2021-07-09T12:55:43.587+0700 7fa4e9951700 1 beast: 0x7fa4490f36b0: 10.111.111.111 - - [2021-07-09T12:55:43.587351+0700] "GET /admin/log/?type=data&id=22&marker=1_1625810142.071426_1063846896.1&extra-info=true&rgwx-zonegroup=31a5ea05-c87a-436d-9ca0-ccfcbad481e3 HTTP/1.1" 200 44 - - -
This is my filter:
%{TIMESTAMP_ISO8601:LogTimestamp}\] \"%{WORD:request_method} (?<swift_v1>(/swift/v1){0,1})/(?<bucketname>(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.{1,})*([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]))(\?|/\?|/)(?<list_type=2>(list-type=2){0,1})%{GREEDYDATA}%{SPACE}HTTP/1.1\" %{NUMBER:httprespcode:int}
I just read some exciting articles here improve grok.
here is some of the point :
According to the source, the time spent checking that a line doesn't match can be up to 6 times slower than a regular (successful) match especially with no match start and no match end. so you can improve the grok patterns by hacking through the failed match. you might want to user anchors ^ and $ to help grok decide faster based on the beginning or the end.
sample
%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}
then
^%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}$
result
made the initial match failure detection around 10 times faster.
all rights belong to respective writer. amazing articles. you have to check the link out.

Is there a way to replace all spaces outside of pair of two characters in string?

I have a log file, that I need to make into a csv. For that I need to replace all spaces with | character.
My code so far:
with open('Log_jeden_den.log', 'r') as f:
for line in f:
line = re.sub(r'[ ]+(?![^[]*\])', '|', line)
An example line of this file looks like this:
123.456.789.10 - - [20/Feb/2020:06:25:16 +0100] "GET /android-icon-192x192.png HTTP/1.1" 200 4026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
As you can see, there are spaces inside the [] and "" pairs. Those I do not want to replace. Only the spaces outside of them.
I can do this for [], with this regex [ ]+(?![^[]*\]), but if I do the same for "" with similar regex [ ]+(?![^"]*\"), it does not work. I tried multiple variations of this regex, none of them worked. What am I missing?
If I work this out, then I would also need to combine those regexes, so I only replace the spaces outside of both character pairs. That would be my second question.
EDIT: Output of my example line as requested:
123.456.789.10|-|-|[20/Feb/2020:06:25:16 +0100]|"GET|/android-icon-192x192.png|HTTP/1.1"|200|4026|"-"|"Mozilla/5.0|(Windows|NT|6.1;|WOW64;|Trident/7.0;|rv:11.0)|like|Gecko"
EDIT2: This would be my desired output:
123.456.789.10|-|-|[20/Feb/2020:06:25:16 +0100]|"GET /android-icon-192x192.png HTTP/1.1"|200|4026|"-"|"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
You may use
with open('Log_jeden_den_out.log', 'w') as fw:
with open('Log_jeden_den.log', 'r') as fr:
for line in fr:
fw.write( re.sub(r'(\[[^][]*]|"[^"]*")|\s', lambda x: x.group(1) if x.group(1) else "|", line) )
Details
(\[[^][]*]|"[^"]*") - Matches and captures into Group 1 any substring between the closest [ and ] or " and "
| - or
\s - just matches any whitespace char in any other context
The lambda x: x.group(1) if x.group(1) else "|" replacement puts back Group 1 if it matched, else, replaces with a pipe.

how do i scrape the web pages where page number is not shown in URL, or no link is provided

Hi i need to scrape the following web pages, but when it comes to pagination I am not able to fetch URL with page number.
https://www.nasdaq.com/market-activity/commodities/ho%3Anmx/historical
You could go through the api to iterate through each page (or in the case of this api an offset number - as it tells you how many total records there are). Take the total records, then divide by the limit set (and use math.ceiling to round up. Then iterate the range from 1, to that number using the multiple as an offset of the limit as a parameter).
Or, just easier, adjust the limit to something higher, and get it in one request:
import requests
from pandas.io.json import json_normalize
url = 'https://api.nasdaq.com/api/quote/HO%3ANMX/historical'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
payload = {
'assetclass': 'commodities',
'fromdate': '2020-01-05',
'limit': '9999',
'todate': '2020-02-05'}
data = requests.get(url, headers=headers,params=payload).json()
df = json_normalize(data['data']['tradesTable']['rows'])
Output:
print (df.to_string())
close date high low open volume
0 1.5839 02/04/2020 1.6179 1.5697 1.5699 66,881
1 1.5779 02/03/2020 1.6273 1.5707 1.6188 62,146
2 1.6284 01/31/2020 1.6786 1.6181 1.6677 68,513
3 1.642 01/30/2020 1.699 1.6305 1.6952 70,173
4 1.7043 01/29/2020 1.7355 1.6933 1.7261 69,082
5 1.7162 01/28/2020 1.7303 1.66 1.674 79,852
6 1.6829 01/27/2020 1.7305 1.6598 1.7279 97,184
7 1.7374 01/24/2020 1.7441 1.7369 1.7394 80,351
8 1.7943 01/23/2020 1.7981 1.7558 1.7919 89,084
9 1.8048 01/22/2020 1.811 1.7838 1.7929 90,311
10 1.8292 01/21/2020 1.8859 1.8242 1.8782 53,130
11 1.8637 01/17/2020 1.875 1.8472 1.8669 79,766
12 1.8647 01/16/2020 1.8926 1.8615 1.8866 99,020
13 1.8822 01/15/2020 1.9168 1.8797 1.9043 92,401
14 1.9103 01/14/2020 1.9224 1.8848 1.898 62,254
15 1.898 01/13/2020 1.94 1.8941 1.9366 61,328
16 1.9284 01/10/2020 1.96 1.9262 1.9522 67,329
17 1.9501 01/09/2020 1.9722 1.9282 1.9665 73,527
18 1.9582 01/08/2020 1.9776 1.9648 1.9759 110,514
19 2.0324 01/07/2020 2.0392 2.0065 2.0274 72,421
20 2.0339 01/06/2020 2.103 2.0193 2.0755 87,832

How to parse a log separate by comma in which one of the fields contains a comma with Grok

I have some logs which look like this :
2019-10-24 15:14:46,183 [http-nio-8080-exec-2] [bf7ccfa6-e24f-4854-9b1f-753a7886e351, 10.0.22.13, /project/api/traductions/ng/fr, GET, , Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36, 10.128.22.82, 10.255.0.3, 10.0.10.56, 10.0.22.27] [] INFO ContextUserRepositorySpring - This is my message, with any characters in it. (like those $ - )
2019-10-24 15:21:51,699 [http-nio-8080-exec-8] [, , , , , , ] [] INFO SecurityAuditEventListener - AuditEvent onApplicationEvent: org.springframework.boot.actuate.audit.listener.AuditApplicationEvent[source=AuditEvent [timestamp=2019-10-24T13:21:51.699813Z, principal=EOFK, type=AUTHENTIFICATION_SUCCESS, data={isauthenticated=true, authTpye=Authentification JWT, requestPath=/api/efsef/subscribe}]]
2019-10-24 15:23:22,578 [http-nio-8080-exec-2] [32d1189f-eac5-47ad-b52e-33323e07c4d7, 10.0.22.13, /webai/api/environnement/, GET, , Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0, 10.128.20.145, 10.255.0.3, 10.0.10.56, 10.0.22.27] [] INFO SecurityAuditEventListener - AuditEvent onApplicationEvent: ServletRequestHandledEvent: url=[/project/api/environnement/]; client=[10.0.22.13]; method=[GET]; servlet=[dispatcherServlet]; session=[null]; user=[null]; time=[2ms]; status=[OK]
I have a grok parse like that:
"^%{TIMESTAMP_ISO8601:date}\s+\[%{EMAILLOCALPART:thread}\]\s\[(?:|%{UUID:ident}),\s(?:|%{IP:remoteHost}),\s(?:|%{URIPATH:uri}),\s(?:|%{WORD:verb}),\s(?:|%{NOTSPACE:query}),\s(?<userAgent>[^,]*),\s(?:|%{IP:ip1},\s%{IP:ip2},\s%{IP:ip3},\s%{IP:ip4})\]\s\[\]\s%{LOGLEVEL:loglevel}\s+%{NOTSPACE:logger}\s\-\s%{GREEDYDATA:logmessage}"
In this case, the second and third lines are well parsed. But not the first one, because of this comma present. (KHTML, like Gecko). And the regex fails at this level : (?<userAgent>[^,]*),\s
Now I'm wondering if there is a way to get all the logs well parsed. With the current logs. Just by changing the grok matching regex...

How to read all ip address from xxx.log file and print their count?

I am JasperReports Developer, but my manager moved me to work on Python 3 project to read the IP address from 'fileName.log' file and want me to print the count of IP address if one IP watched my Video more than one time.
I am very new to Python 3. Please help me with this problem.
My file as below:
66.23.64.12 - - [06/Nov/2014:19:10:38 +0600] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
64.24.65.93 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
78.849.65.62 - - [06/Nov/2014:19:12:14 +0600] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
78.849.65.62 - - [06/Nov/2014:19:12:14 +0600] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
98.449.65.19 - - [06/Nov/2014:19:10:38 +0600] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
54.49.65.03 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
54.49.65.03 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
45.79.65.62 - - [06/Nov/2014:19:12:14 +0600] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Output as below:
IP Count
98.449.65.19 2
54.49.65.03 4
Here is one method, it involves storing the IP address in a dictionary, which may be useful depending on what else you would like to do with the data.
# Read in the text file
with open('fileName.log','r') as f:
lines = f.readlines()
data = {}
for line in lines:
# Split the line each time a space appears, and take the first element (the IP address)
print(line)
ipAddr = line.split()[0]
if ipAddr in data:
data[ipAddr]+=1
else:
data[ipAddr]=1
# Print counts of each IP address
print(' IP Count')
for key, val in data.items():
print(key, val)
Output:
IP Count
66.23.64.12 1
64.24.65.93 1
78.849.65.62 2
98.449.65.19 1
54.49.65.03 2
45.79.65.62 1

Resources