Extracting string from Regex in Pandas for large dataset - python-3.x

We have a csv file which contains log entries in each row.
We need to extract the thread names from each log entry into a separate column.
What would be the fastest way to implement the same ?
The approach below (string functions) also seems to take alot of time for large datasets.
We have csv files with minimum of 100K entries in each csv file.
This is the piece of codes which extracts the path
df['thread'] = df.message.str.extract(pat = '(\[(\w+.)+?\]|$)')[0]
The below is a sample log entry, we are picking out:
[c.a.j.sprint_planning_resources.listener.RunAsyncEvent]
from the regex above.
2020-12-01 05:07:36,485-0500 ForkJoinPool.commonPool-worker-30 WARN Ives_Chen 245x27568399x23 oxk7fv 10.97.200.99,127.0.0.1 /browse/MDT-206838 [c.a.j.sprint_planning_resources.listener.RunAsyncEvent] Event processed: com.atlassian.jira.event.issue.IssueEvent#5c8703d0[issue=ABC-61381,comment=<null>,worklog=<null>,changelog=[GenericEntity:ChangeGroup][issue,1443521][author,JIRAUSER39166][created,2020-12-01 05:07:36.377][id,15932782],eventTypeId=2,sendMail=true,params={eventsource=action, baseurl=https://min.com},subtasksUpdated=true,spanningOperation=Optional.empty]
Does anyone know a better/faster method to implement the same ?

The \[(\w+.)+?\] is a very inefficient pattern that may cause catastrophic backtracking due to the nested quantifiers with an unescaped . that matches any char, and thus also matches what \w does.
You can use
df['thread'] = df['message'].str.extract(r'\[(\w+(?:\.\w+)*)]', expand=False).fillna("")
See this regex demo. Note there is no need adding $ as an alternative since .fillna("") will replace the NA with an empty string.
The regex matches
\[ - a [ char
(\w+(?:\.\w+)*) - Capturing group 1: one or more word chars followed with zero or more sequences of a . and one or more word chars
] - a ] char.

Your regex takes a whopping 8,572 steps to complete, see https://regex101.com/r/5c3vi7/1
You can use this regex to significantly cut down the regex processing to 4 steps:
\[[^\]]+\]
Do notice the absence of the /g modifier
https://regex101.com/r/6522P8/1

Related

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

How can I remove all characters inside angle brackets python?

How can I remove all characters inside angle brackets including the brackets in a string? How can I also remove all the text between ("\r\n") and ("."+"any 3 characters") Is this possible? I am currently using the solution by #xkcdjerry
e.g
body = """Dear Students roads etc. you place a tree take a snapshot, then when you place a\r\nbuilding, take a snapshot. Place at least 5-6 objects and then have 5-6\r\nsnapshots. Please keep these snapshots with you as everyone will be asked\r\nto share them during the class.\r\n\r\nI am attaching one PowerPoint containing instructions and one video of\r\nexplanation for your reference.\r\n\r\nKind regards,\r\nTeacher Name\r\n zoom_0.mp4\r\n<https://drive.google.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>"""
d = re.compile("\r\n.+?\\....")
body = d.sub('', body)
a = re.compile("<.*?>")
body = a.sub('', body)
print(body)```
For some reason the output is fine except that it has:
```gle.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>
randomly attached to the end How can I fix it.
Answer
Your problem can be solved by a regex:
Put this into the shell:
import re
a=re.compile("<.*?>")
a.sub('',"Keep this part of the string< Remove this part>Keep This part as well")
Output:
'Keep this part of the stringKeep This part as well'
Second question:
import re
re.compile("\r\n.*?\\..{3}")
a.sub('',"Hello\r\nFilename.png")
Output:
'Hello'
Breakdown
Regex is a robust way of finding, replacing, and mutating small strings inside bigger ones, for further reading,consult https://docs.python.org/3/library/re.html. Meanwhile, here are the breakdowns of the regex information used in this answer:
. means any char.
*? means as many of the before as needed but as little as possible(non-greedy match)
So .*? means any number of characters but as little as possible.
Note: The reason there is a \\. in the second regex is that a . in the match needs to be escaped by a \, which in its turn needs to be escaped as \\
The methods:
re.compile(patten:str) compiles a regex for farther use.
regex.sub(repl:str,string:str) replaces every match of regex in string with repl.
Hope it helps.

Find the maximal input string matching a regular expression

Given a regular expression re and an input string str, I want to find the maximal substring of str, which starts at the minimal position, which matches re.
Special case:
re = Regex("a+|[ax](bc)*"); str = "yyabcbcb"
matching re with str should return the matching string "abcbc" (and not "a", as PCRE does). I also have in mind, that the result is as I want, if the order of the alternations is changed.
The options I found were:
POSIX extended RE - probably outdated, used by egrep ...
RE2 by Google - open source RE2 - C++ - also C-wrapper available
From my point of view, there are two problems with your question.
First is that changing the order of alternations the results are supposed to change.
For each single 'a' in the string, it can either match 'a+' or "ax*".
So it is ambiguous for matching 'a' to alternations in your regular expression.
Second, for finding the maximal substring, it requires the matching pattern of the longest match. As far as I know, only RE2 has provided such a feature, as mentioned by #Cosinus.
So my recommendation is that separating "a+|ax*" into two regexes, finding the maximal substring in each of them, and then comparing the positions of both substrings.
As to find the longest match, you can also refer to a previous regex post description here. The main idea is to search for substrings starting from string position 0 to len(str) and to keep track of the length and position when matched substrings are found.
P.S. Some languages provide regex functions similar to "findall()". Be careful of using them since the returns may be non-overlapping matches. And non-overlapping matches do not necessarily contain the longest matching substring.

substitue string by index without using regular expressions

It should be very easy, but I am looking for an efficient way to perform it.
I know that I could split the string into two parts and insert the new value, but I have tried to substitute each line between the indexes 22-26 as follows:
line.replace(line[22:26],new_value)
The Problem
However, that function substitutes everything in the line that is similar to the pattern in line[22:26].
In the example below, I want to replace the marked number 1 with number 17:
Here are the results. Note the replacement of 1 with 17 in several places:
Thus I don't understand the behavior of replace command. Is there a simple explanation of what I'm doing wrong?
Why I don't want RE
The values between index 22-26 are not unified in form.
Note: I am using python 3.5 on Unix/Linux machines.
str.replace replaces 1 sub-string pattern with another everywhere in the string.
e.g.
'ab cd ab ab'.replace('ab', 'xy')
# produces output 'xy cd xy xy'
similarly,
mystr = 'ab cd ab ab'
mystr.replace(mystr[0:2], 'xy')
# also produces output 'xy cd xy xy'
what you could do instead, to replace just the characters in position 22-26
line = line[0:22] + new_value + line[26:]
Also, looking at your data, it seems to me to be a fixed-width text file. While my suggestion will work, a more robust way to process this data would be to read it & separate the different fields in the record first, before processing the data.
If you have access to the pandas library, it provides a useful function just for reading fixed-width files

Using sed to drop strings with repeated and incremental characters?

I'm trying to use sed to drop strings containing repeated characters before appending them to a file.
So far I have this, to drop stings with consecutive repetition like 'AA' or '22', but I'm struggling with full string repetition and incremental characters.
generic string generator | sed '/\([^A-Za-z0-9_]\|[A-Za-z0-9]\)\1\{1,\}/d' >> parsed sting to file
I also want to drop strings contain any repetition like 'ABA'.
As well as, strings containing any ascending or descending characters like 'AEF' or 'AFE'.
I'm assuming it would be easier to use multiple passes of sed to drop the unwanted strings.
** A little more information to try to avoid the XY problem mentioned. **
The character strings could be from 8 to 64 in length, but in this instance I'm focusing on 8. While at the same time I've restricted the string generation to only output an upper-case alpha string (A-Z). This is for a few reasons, but mainly that I don't want the generated file to have a ridiculously huge footprint.
With the first pass of sed dropping unnecessary outputs like 'AAAAAAAA' and 'AAAAAAAB' from the stream. This results in the file starting with strings 'ABABABAB' and 'ABABABAC'.
Next pass I want to check that from one character to the next doesn't increase or decrease by a value of one. So strings like 'ABABABAB' would be dropped, but 'ACACACAC' would parse to the stream.
Next pass I want to drop strings that contain any repeated characters in the whole string. So strings like 'ACACACAC' would be dropped, but 'ACEBDFHJ' would parse to the file.
Hope that helps.
In order to do what you're describing with sed, you'd need to run it many times. Since sed doesn't understand the concept of "this character is incremental from this other character", you need to run it across all possible combinations:
sed '/AB/d'
sed '/BC/d'
sed '/CD/d'
sed '/DE/d'
etc.
For descending characters, the same thing:
sed '/BA/d'
sed '/CB/d'
In order to then drop strings with repeated characters, you can do something like this:
sed '/\(.\).*\1/d'
The following should do the trick:
generic string generator |sed '/\(.\).*\1/d'|sed /BA/d|sed /AB/d||sed /CB/d|sed /BC/d|sed /DC/d|sed /CD/d|sed /ED/d|sed /DE/d|sed /FE/d|sed /EF/d|sed /GF/d|sed /FG/d|sed /HG/d|sed /GH/d|sed /IH/d|sed /HI/d|sed /JI/d|sed /IJ/d|sed /KJ/d|sed /JK/d|sed /LK/d|sed /KL/d|sed /ML/d|sed /LM/d|sed /NM/d|sed /MN/d|sed /ON/d|sed /NO/d|sed /PO/d|sed /OP/d|sed /QP/d|sed /PQ/d|sed /RQ/d|sed /QR/d|sed /SR/d|sed /RS/d|sed /TS/d|sed /ST/d|sed /UT/d|sed /TU/d|sed /VU/d|sed /UV/d|sed /WV/d|sed /VW/d|sed /XW/d|sed /WX/d|sed /YX/d|sed /XY/d|sed /ZY/d|sed /YZ/d
I only tested this on a few input samples, but they all seemed to work.
Note that this is quite ungainly, and would be better done by something a little more sophisticated than sed. Here's a sample in python:
import math
def isvalid(x):
if set(len(x)) < len(x):
return False
for a in range(1, len(x)):
if math.fabs(ord(x[a])-ord(x[a-1])) == 1:
return False
return True
This is much more readable than the giant set of sed calls, and has the same functionality.

Resources