Replace $$ or more with single spaceusing Regex in python - python-3.x

In the following list of string i want to remove $$ or more with only one space.
eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced.
I am using the following regex but i'm not sure if it serves the purpose
regex_pattern = r"['$$']{2,}?"
Following is the test string list:
['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']

About
I am using the following regex but i'm not sure if it serves the
purpose
The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way.
Your pattern currently get the right matches, as there are no parts present like '' or $'
As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$
You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched:
regex_pattern = r"\${2,}"
In the replacement use a space.

Is this what you need?:
import re
for d in data:
d = re.sub(r'\${2,}', ' ', d)

Related

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

Remove leading dollar sign from data and improve current solution

I have string like so:
"Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
I'd like to retieve the datte from the table name, so far I use:
"\$[0-9]+"
but this yields $20220316. How do I get only the date, without $?
I'd also like to get the table name: n_Cars_12345678$20220316
So far I have this:
pattern_table_info = "\(([^\)]+)\)"
pattern_table_name = "(?<=table ).*"
table_info = re.search(pattern_table_info, message).group(1)
table = re.search(pattern_table_name, table_info).group(0)
However I'd like to have a more simpler solution, how can I improve this?
EDIT:
Actually the table name should be:
n_Cars_12345678
So everything before the "$" sign and after "table"...how can this part of the string be retrieved?
You can use a regex with two capturing groups:
table\s+([^()]*)\$([0-9]+)
See the regex demo. Details:
table - a word
\s+ - one or more whitespaces
([^()]*) - Group 1: zero or more chars other than ( and )
\$ - a $ char
([0-9]+) - Group 2: one or more digits.
See the Python demo:
import re
text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
rx = r"table\s+([^()]*)\$([0-9]+)"
m = re.search(rx, text)
if m:
print(m.group(1))
print(m.group(2))
Output:
n_Cars_1234567
20220316
You can write a single pattern with 2 capture groups:
\(table (\w+\$(\d+))\)
The pattern matches:
\(table
( Capture group 1
\w+\$ match 1+ word characters and $
(\d+) Capture group 2, match 1+ digits
) Close group 1
\) Match )
See a Regex demo and a Python demo.
import re
s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
m = re.search(r"\(table (\w+\$(\d+))\)", s)
if m:
print(m.group(1))
print(m.group(2))
Output
n_Cars_1234567$20220316
20220316

How to identify number near word using regex

Need to identify numbers near keyword number:, no:, etc..
Tried:
import re
matchstring="Sales Quote"
string_lst = ['number:', 'No:','no:','number','No : ']
x=""" Sentence1: Sales Quote number 36886DJ9 is entered
Sentence2: SALES QUOTE No: 89745DFD is entered
Sentence3: Sales Quote No : 7964KL is entered
Sentence4: SALES QUOTE NUMBER:879654DF is entered
Sentence5: salesquote no: 9874656LD is entered"""
documentnumber= re.findall(r"(?:(?<="+matchstring+ '|'.join(string_lst)+r')) [\w\d-]',x,flags=re.IGNORECASE)
print(documentnumber)
Required soln:36886DJ9,89745DFD,7964KL,879654DF,9874656LD
Is there any solution?
Actually your solution is very close. You just need some missing parenthesis and check for optional whitespace:
documentnumber = re.findall(r"(?:(?<="+matchstring + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
However this won't match with the last one (9874656LD) because of the missing whitespace between "Sales" and "quote". If you want to build it in the same way than the rest of the pattern, replace the lookbehind by a non capturing group and join words with \s?:
documentnumber= re.findall(r"(?:(?:" + "\s?".join(matchstring.split()) + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
Output:
['36886DJ9', '89745DFD', '7964KL', '879654DF', '9874656LD']

Regex pattern captures address well but fails to detect the zipcode and returns only the first letter of the city

In this case the pattern captures the address well but fails when it comes to the zip code and the city:
The text fragment is:
'''
Kοινοποίηση: • Εταιρεία: «ICON FITNESS ΕΚΜΕΤΑΛΛΕΥΣΗ ΓΥΜΝΑΣΤΗΡΙΩΝ ΑΝΩΝΥΜΗ ΕΤΑΙΡΕΙΑ» Δ/νση: ΣΕΚΕΡΗ 5 Τ.Κ.: 10671, ΑΘΗΝΑ • Περιφέρεια Αττικής Γεν. Δ/νση Ανάπτυξης Δ/νση Ανάπτυξης Π.Ε. Κεντρικού Τομέα Αθηνών Τμήμα Ανωνύμω
'''
The regex pattern is:
dieythinsi_pattern1 = re.compile(r'Kοινοποίηση:?.*«[^«»]*»\s+.+?:?(\w+)\s(\d+)\s*(?:,?\s*(\d{5}))?(?:,?\s*([\w]+))?', re.I)
This returns the street and the number but fails to detect the zipcoe and returns only the initial letter of the city.
The address is the following: ΣΕΚΕΡΗ 5 Τ.Κ.: 10671, ΑΘΗΝΑ where 10671 is the zip code.

How to match optional Number along with alphanumeric in Ruta Script

I am working on entity extraction in Pega. I have requirement to match a policy number which has 3 parts:
1) Optionally 1 would be first character in policy. It is optional
2) alphanumeric of length 2 followed by optionally Hyphen or Space
3) alphanumeric of length 3
So some examples of formats are:
AB-CDE, AB CDE, ABCDE, 1AB-CDE
23-456, 23 456, 23456, 123456
AB-2B4, AB-B2C, A1-2B4, 2A-34B, 12A-34B, 123-45C etc.
I am facing problem whenever policy number is starting with 2 or 3 digits or it don't have any space or hyphen.
For example 12A-34B, 123-45C, 23456, 123456.
I have written below script:
PACKAGE uima.ruta.example;
Document{-> RETAINTYPE(SPACE)};
("1")+? ((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,4)};
((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,3)};
This code is working fine for patterns having space/hyphen like:
AB-CDE, AB CDE, 1AB-CDE. But not working if don't have space and hyphen or pattern starts with 2 or 3 digits.
Please help to write correct pattern.
Thanks in advance.
The UIMA Ruta seed annotation NUM, covers the whole number. Therefore, examples like 23456, 123456 cannot be split in subannotations by Ruta.
A solution would be to use pure regexp to annotate all the mentioned examples:
"\\w{2,3}[\\-|\\s]?\\w{2,3}" -> EntityType;

Resources