Regex Replacements of Gibberish in Python Pandas - python-3.x

I have some strings, some of which are gibberish, a mixture of digits and letters. The gibberish, I would like to remove, but those with a pattern, I would like to keep.
I am providing an example for illustrative purposes.
strings = ["1Z83E0590391137855",
"55t5555t5t5tttt5t5555tttttttgggggggggggggggsss",
"1st", "2nd", "3rd", "4th", "5th"
]
import pandas as pd
df = pd.DataFrame(strings, columns=['strs'])
df
I would like to remove strings that look like
1Z83E0590391137855
55t5555t5t5tttt5t5555tttttttgggggggsss
and keep strings that look like ones below
1st
2nd
3rd
4th
5th
Given my limited regex and python experience, I am having some difficulty coming up with the right formulation. What I have tried, has removed everything, except the first row:
df['strs'] = df['strs'].str.replace(r'(?=.*[a-z])(?=.*[\d])[a-z\d]+', '', regex=True)

I suggest only matching alphanumeric strings containing both letters and digits that contain a certain amount of chars.
In the example below, I set the threshold to 18, i.e. the strings shorter than 18 chars won't be matched and thus will remain in the column. All the strings equal or longer will get removed:
df['strs'] = df['strs'].str.replace(r'^(?=.{18})(?:[a-zA-Z]+\d|\d+[a-zA-Z])[a-zA-Z\d]*$', '', regex=True)
Details:
^ - start of string
(?=.{18}) - the string must start with 18 chars other than line break chars
(?:[a-zA-Z]+\d|\d+[a-zA-Z]) - one or more letters and then a digit or one or more digits and then a letter
[a-zA-Z\d]* - zero or more alphanumeric chars
$ - end of string.
See the regex demo.

You could check that the line does not start with 1st 2nd.. to remove only those lines.
^(?!\d+(?:st|nd|rd|th)$).*$
Regex demo

Related

How to substitute a repeating character with the same number of a different character in regex python?

Assume there's a string
"An example striiiiiing with other words"
I need to replace the 'i's with '*'s like 'str******ng'. The number of '*' must be same as 'i'. This replacement should happen only if there are consecutive 'i' greater than or equal to 3. If the number of 'i' is less than 3 then there is a different rule for that. I can hard code it:
import re
text = "An example striiiiing with other words"
out_put = re.sub(re.compile(r'i{3}', re.I), r'*'*3, text)
print(out_put)
# An example str***iing with other words
But number of i could be any number greater than 3. How can we do that using regex?
The i{3} pattern only matches iii anywhere in the string. You need i{3,} to match three or more is. However, to make it all work, you need to pass your match into a callable used as a replacement argument to re.sub, where you can get the match text length and multiply correctly.
Also, it is advisable to declare the regex outside of re.sub, or just use a string pattern since patterns are cached.
Here is the code that fixes the issue:
import re
text = "An example striiiiing with other words"
rx = re.compile(r'i{3,}', re.I)
out_put = rx.sub(lambda x: r'*'*len(x.group()), text)
print(out_put)
# => An example str*****ng with other words

Removing Characters With Regular Expression in List Comprehension in Python

I am learning python and I am trying to do some text preprocessing and I have been reading and borrowing ideas from Stackoverflow. I was able to come up with the following formulations below, but they don't appear to do what I was expecting, and they don't throw any errors either, so I'm stumped.
First, in a Pandas dataframe column, I am trying to remove the third consecutive character in a word; it's kind of like running a spell check on words that are supposed to have two consecutive characters instead of three
buttter = butter
bettter = better
ladder = ladder
The code I used is below:
import re
docs['Comments'] = [c for c in docs['Comments'] if re.sub(r'(\w)\1{2,}', r'\1', c)]
In the second instance, I just want to to replace multiple punctuations with the last one.
????? = ?
..... = .
!!!!! = !
---- = -
***** = *
And the code I have for that is:
docs['Comments'] = [i for i in docs['Comments'] if re.sub(r'[\?\.\!\*]+(?=[\?\.\!\*])', '', i)]
It looks like you want to use
docs['Comments'] = docs['Comments'].str.replace(r'(\w)\1{2,}', r'\1\1', regex=True)
.str.replace(r'([^\w\s]|_)(\1)+', r'\2', regex=True)
The r'(\w)\1{2,}' regex finds three or more repeated word chars and \1\1 replaces with two their occurrences. See this regex demo.
The r'([^\w\s]|_)(\1)+' regex matches repeated punctuation chars and captures the last into Group 2, so \2 replaces the match with the last punctuation char. See this regex demo.

python converting strings into three blocks and if not two blocks

I want to write a function that converts the given string T and group them into three blocks.
However, I want to split the last block into two if it can't be broken down to three numbers.
For example, this is my code
import re
def num_format(T):
clean_number = re.sub('[^0-9]+', '', T)
formatted_number = re.sub(r"(\d{3})(?=(\d{3})+(?!\d{3}))", r"\1-", clean_number)
return formatted_number
num_format("05553--70002654")
this returns : '055-537-000-2654' as a result.
However, I want it to be '055-537-000-26-54'.
I used the regular expression, but have no idea how to split the last remaining numbers into two blocks!
I would really appreciate helping me to figure this problem out!!
Thanks in advance.
You can use
def num_format(T):
clean_number = ''.join(c for c in T if c.isdigit())
return re.sub(r'(\d{3})(?=\d{2})|(?<=\d{2})(?=\d{2}$)', r'\1-', clean_number)
See the regex demo.
Note you can get rid of all non-numeric chars using plain Python comprehension, the solution is borrowed from Removing all non-numeric characters from string in Python.
The regex matches
(\d{3}) - Group 1 (\1): three digits...
(?=\d{2}) - followed with two digits
| - or
(?<=\d{2})(?=\d{2}$) - a location between any two digit sequence and two digits that are at the end of string.
See the Python demo:
import re
def num_format(T):
clean_number = ''.join(c for c in T if c.isdigit())
return re.sub(r'(\d{3})(?=\d{2})|(?<=\d{2})(?=\d{2}$)', r'\1-', clean_number)
print(num_format("05553--70002654"))
# => 055-537-000-26-54

The correct way to identify a regular expression of the sort [variableName].add(

I'm looking for a clean way to identify occurrences of [variableName] followed by the exact string .add(.
A variable name is a string which contains one or more characters from a-z, A-Z, 0-9 and an underscore.
One more thing is that it cannot start with any of the characters from 0-9, but I don't mind ignoring this condition because there are no such cases in the text that I need to parse anyway.
I've been following several tutorials, but the farthest I got was finding all occurrences of what I've referred to above as "variableName":
import re
txt = "The _rain() in+ Spain5"
x = re.split("[^a-zA-Z0-9_]+", txt)
print(x)
What is the right way to do it?
You may use
re.findall(r'\w+(?=\.add\()', txt, flags=re.ASCII)
The regex matches:
\w+ - 1+ word chars (due to re.ASCII, it only matches [A-Za-z0-9_] chars)
(?=\.add\() - a positive lookahead that matches a location immediately followed with .add( substring.

How to find arabic character all occurrences with its dialect?

I am trying to find the occurence of arabic character with its harakat in string such as "رَّ" in "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ".
Arabic characters can take harakat for example "ر" is the original arabic character but can have harakat so it can look something like this "رَّ"> I am using Python 3 to find the character occurence with a specific harakat but could not do that. I have tried for loop and tried converting the string to unicode but could not do that.
str = "مرة رجل حكيم قال بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
i=0
for s in str:
if s == "رَّ":
i = i + 1
print(i)
Expected output is 2 but 0 is what I get.
len("رَّ") returns 3, which means the glyph is represented by three characters. Your loop checks a single character at a time and so never finds a match.
You need to be looking for substrings, which is exactly what .count() is for.
i = str.count('رَّ')

Resources