Regex to extract lines that starts with any character but not with characters /words like Rs or SBC using Python - python-3.x

Below is the input file:
8% OFF
Sugar Free Gold Sweetener
500 tablets
Rs230
Rs 250
SBC IconClub Price: Rs227
43% OFF
Palm Oil (Bottle)
1 l
Rs186
Rs 330
SBC IconClub Price: Rs185
I am trying to extract those lines that are not starting with either Rs or SBC or any number. They are actually product names which are starting with words only. I used below Regex:
^[A-Za-z].*$
This removed the lines starting with numbers only.
However, I am unable to remove the below lines that are starting with RS, and SBC.
Rs230
Rs 250
SBC IconClub Price: Rs227
Rs186
Rs 330
SBC IconClub Price: Rs185
Could anyone please help me with the regex in Python which can give the product name only? Products are on line 2 and 8.

You can exclude matching SBC RS or Rs using a negative lookahead:
^(?!SBC|R[sS])[A-Za-z].*$
Regex demo
If the strings all start with an uppercase char, you can also use
^(?!SBC|R[sS])[A-Z].*$

Related

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

Replace $$ or more with single spaceusing Regex in python

In the following list of string i want to remove $$ or more with only one space.
eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced.
I am using the following regex but i'm not sure if it serves the purpose
regex_pattern = r"['$$']{2,}?"
Following is the test string list:
['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']
About
I am using the following regex but i'm not sure if it serves the
purpose
The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way.
Your pattern currently get the right matches, as there are no parts present like '' or $'
As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$
You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched:
regex_pattern = r"\${2,}"
In the replacement use a space.
Is this what you need?:
import re
for d in data:
d = re.sub(r'\${2,}', ' ', d)

How to match optional Number along with alphanumeric in Ruta Script

I am working on entity extraction in Pega. I have requirement to match a policy number which has 3 parts:
1) Optionally 1 would be first character in policy. It is optional
2) alphanumeric of length 2 followed by optionally Hyphen or Space
3) alphanumeric of length 3
So some examples of formats are:
AB-CDE, AB CDE, ABCDE, 1AB-CDE
23-456, 23 456, 23456, 123456
AB-2B4, AB-B2C, A1-2B4, 2A-34B, 12A-34B, 123-45C etc.
I am facing problem whenever policy number is starting with 2 or 3 digits or it don't have any space or hyphen.
For example 12A-34B, 123-45C, 23456, 123456.
I have written below script:
PACKAGE uima.ruta.example;
Document{-> RETAINTYPE(SPACE)};
("1")+? ((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,4)};
((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,3)};
This code is working fine for patterns having space/hyphen like:
AB-CDE, AB CDE, 1AB-CDE. But not working if don't have space and hyphen or pattern starts with 2 or 3 digits.
Please help to write correct pattern.
Thanks in advance.
The UIMA Ruta seed annotation NUM, covers the whole number. Therefore, examples like 23456, 123456 cannot be split in subannotations by Ruta.
A solution would be to use pure regexp to annotate all the mentioned examples:
"\\w{2,3}[\\-|\\s]?\\w{2,3}" -> EntityType;

Separate complex text from number patterns from a single string

I have a text file that looks something like this (but with hundreds of lines):
1147-1 SYRUP: DR.PEPPER 5GALLON/BOX
1653-1 SYRUP: DIET DR.PEPPER 5GAL/BOX
2011-2 WATER DISTILLED 6 / 1 GA
1217-2 ALL PURPOSE RASPBERRY FIL 40#
1273-1 STRAWBERRY PIE FILLING 38#
2893-1 BREAD: SOURDOUGH 12/1# OVAL
2287-1 BREAD SQUAW: 8/2.25#LF
1929-1 VINEGAR HERB CONT GRDN 12/12.7
1949-2 KETCHUP: 16/14OZ-PLASTIC BTLS
2408-1 CONE 3 NAB SAMPLER 28/45
2939-1 DULCE LECH FLVR PKT 3/12 EA CS
3017-1 GINGRBRD FLVR PKT 3/12 EA CS
3055-2 EGGNOG FLVR PKT 3/12 EA CS
3192-1 ORIGINAL MRS. DASH SEASONING
I've created the code to pull everything in from the text file line by line and strip out the numbers at the beginning and save the next portion (ie SYRUP: DR.PEPPER 5GALLON/BOX, ALL PURPOSE RASPBERRY FIL 40#) to Mid(nextLine, 10, 30). I want to take that portion and split it up by pulling the name (SYRUP: DR.PEPPER, ALL PURPOSE RASPBERRY FIL), the size (5GALLON, 40#) the number of that size (if it is 12/1# = 12x 1LB) and the unit (BOX, LB) out. As you can see almost every line is different but with many similarities. Not really sure what to do next. I have been trying to use:
re.Pattern = "GALLON|BOX|#|LF|GAL|EA|CS|BTLS"
to pull out the unit portion but I don't know what else to do.
Here is the code I have so far for this portion:
Function NumericOnly(s As String) As String
Dim StrUnit As String
Static re As RegExp
If re Is Nothing Then Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = "GALLON|BOX|#|LF|GAL|EA|CS|BTLS"
StrUnit = re.Replace(s, "")
End Function

Reading a specific txt file and re-arrange it to a given format

Below is an output of Chemichal analysis instrument. I need to rearrange the format and sort it in a way that percentage figure for each element goes below its name. My question is how to read this file word by word? how can I choose, for instance word number 12?
txt file format:
Header_1 Date Time Method_Name (Filter_Name) Calc_Mode Heat No. Quality Anal. Code Sample ID C Si Mn P S Cr Mo Ni Al Co Cu Nb Ti V W Pb Sn As Bi Ca Sb Se B Zn N Fe Place Code Work Phase
Single 13.01.13 09:51:10 Fe-10 Test AutoResult 12A 00001.040 00000.437 00000.292 00000.023 00000.007 00001.505 00000.263 00000.081 00000.012 00000.014 00000.110 00000.155 00000.040 00000.098 00000.015 00000.014 00000.013 00000.012 00000.002 00000.001 00000.016 00000.014 00000.005 00000.001 00000.016 00095.813
To find word 12, read the line character by character until you have seen 11 instances of whatever is being used to separate words (which you have not specified); what follows, until the next such separator, will be the 12th word.

Resources