How to insert list of words inside particular regex - python-3.x

import re
text = """STAR PLUS LIMITED Unit B & C, 15/F, Casey Aberdeen House, 38 Heung Yip Road, Wong Chuk Hang, Hong Kong. Tel: (852)2511 0112 Fax: 2507 4300 Email: info#starplushk.com Ref No: LSM25781 SALES Sales Quote No: SP21-SQ10452 Buyer's Ref: LSM-021042-5 Messers JSC "Tander" Russian Federation 350002 Krasnodar"""
ref_no = re.findall(r"(?:(?<=Buyer's Ref: )|(?<=Ref No: ))[\w\d-]+",text)
print(ref_no)
Required solution: ['LSM25781', 'LSM-021042-5']
The script above outputs this, but I have man keywords, so I want to generate the regex dynamically. How can I do that?
Tried:
ref_keywords = ["Buyer's Ref:","Ref No:","Reference number:"]
b = r"(?:(?<=" + '|'.join(ref_keyword)+ r" ))[\w\d-]+"
ref_no = re.findall(b, text)
print(ref_no)
This results in the following error
Traceback (most recent call last):
File "/home/v/.config/JetBrains/PyCharm2021.3/scratches/scratch_2.py", line 7, in <module>
ref_no = re.findall(regex, text)
File "/home/v/.pyenv/versions/3.9.5/lib/python3.9/re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
File "/home/v/.pyenv/versions/3.9.5/lib/python3.9/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/v/.pyenv/versions/3.9.5/lib/python3.9/sre_compile.py", line 768, in compile
code = _code(p, flags)
File "/home/v/.pyenv/versions/3.9.5/lib/python3.9/sre_compile.py", line 607, in _code
_compile(code, p.data, flags)
File "/home/v/.pyenv/versions/3.9.5/lib/python3.9/sre_compile.py", line 182, in _compile
raise error("look-behind requires fixed-width pattern")
re.error: look-behind requires fixed-width pattern
Process finished with exit code 1
Is there a solution to add list of keywords inside regex. I cannot use "|" because I have many list of keywords.

key_word = ['key1', 'key2', 'key2']
combined_word = ""
for key in key_word:
combined_word += "|"+key
import re
sentence = "I wana delete key1 and key2 but also key3."
re.split(combined_word, sentence)

You can do it the following way (you almost had it, you just need to create all the regex characters for each keyword:
import re
text = """STAR PLUS LIMITED Unit B & C, 15/F, Casey Aberdeen House, 38 Heung Yip Road, Wong Chuk Hang, Hong Kong. Tel: (852)2511 0112 Fax: 2507 4300 Email: info#starplushk.com Ref No: LSM25781 SALES Sales Quote No: SP21-SQ10452 Buyer's Ref: LSM-021042-5 Messers JSC "Tander" Russian Federation 350002 Krasnodar"""
ref_keywords = ["Buyer's Ref:", "Ref No:", "Reference number:"]
def keyword_to_regex(keyword: str) -> str:
# you missed creating these for each keyword
return f"(?<={keyword} )"
regex_for_all_keywords = r"(?:" + "|".join(map(keyword_to_regex, ref_keywords)) + r")[\w\d-]+"
ref_no = re.findall(regex_for_all_keywords, text)
print(ref_no) # ['LSM25781', 'LSM-021042-5']

Related

Best way to handle element of dict that has multiple key/value pairs inside it

[{'id': 2, 'Registered Address': 'Line 1: 1 Any Street Line 2: Any locale City: Any City Region / State: Any Region Postcode / Zip code: BA2 2SA Country: GB Jurisdiction: Any Jurisdiction'}]
I have the above read into a dataframe and that is the output so far. The issue is I need to break out the individual elements - due to names of places etc the values may or may not have spaces in them - looking at the above my keys are Line 1, Line 2, City, Region / State, Postcode / Zip, Country, Jurisdiction.
Output required for the "Registered Address"-'key'is the keys and values
"Line 1": "1 Any Street"
"Line 2": "Any locale"
"City": "Any City"
"Region / State": "Any Region"
"Postcode / Zip code": "BA2 2SA"
"Country": "GB"
"Jurisdiction": "Any Jurisdiction"
Just struggling to find a way to get to the end result.I have tried to pop out and use urllib.prse but fell short - is anypone able to point me in the best direction please?
Tried to write a code that generalizes your question, but there were some limitations, regarding your data format. Anyway I would do this:
def address_spliter(my_data, my_keys):
address_data = my_data[0]['Registered Address']
key_address = {}
for i,k in enumerate(keys):
print(k)
if k == 'Jurisdiction:':
key_address[k] = address_data.split('Jurisdiction:')[1].removeprefix(' ').removesuffix(' ')
else:
key_address[k] = address_data.split(k)[1].split(keys[i+1])[0].removeprefix(' ').removesuffix(' ')
return key_address
were you can call this function like this:
my_data = [{'id': 2, 'Registered Address': 'Line 1: 1 Any Street Line 2: Any locale City: Any City Region / State: Any Region Postcode / Zip code: BA2 2SA Country: GB Jurisdiction: Any Jurisdiction'}]
and
my_keys = ['Line 1:','Line 2:','City:', 'Region / State:', 'Postcode / Zip code:', 'Country:', 'Jurisdiction']
As you can see It'll work if only the sequence of keys is not changed. But anyway, you can work around this idea and change it base on your problem accordingly if it doesn't go as expected.

After a seperator there is a key in loop. How to keep it?

---------------------------
CompanyID: 000000000000
Pizza: 2 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
---------------------------
CompanyID: 000000000001
burger: 1 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
--------------------------
I'm doing a for loop over a list of lines. Every line is an item of a list. I need to keep the companyID while looking for a user input.
While this is printing the variable x=True. I cant take company ID to print it.
a='-'
for line in lines:
if a in line:
companyID= next(line)
if product in line:
x=True
TypeError: 'str' object is not an iterator
You can use your line seperator to identify when new data starts. Once you see the line with "----" then you can start collecing info in a new dictionary. for each line take its key and value by splitting on ":" and create the entry in the dictionary.
When you see the next "----" line you know thats the end of the data for this company so then do your check to see if they have the product and if so print the company id from the dictionary.
line_seperator_char = '-'
company_data = {}
product = 'burger'
with open('data.dat') as lines:
for line in lines:
line = line.rstrip()
if line.startswith(line_seperator_char):
if product in company_data:
print(f'{company_data["CompanyID"]} contains the product {product}')
company_data = {}
else:
key, value = line.split(':')
company_data[key] = value
OUTPUT
000000000001 contains the product burger
No it doesnt run. Could you explain what does "[1] means near split()[1]?
Another try that doesnt run is
y=[]
y=lines[1].split(' ')
for line in lines:
y=line.split(' ')
if len(y[1])==10:
companyID=y[1]
if product in line:
x=True
Thanks for the answers.Something that finally worked in my case was that:
y=[]
y=line[1].split(' ')
a='-'
for line in lines:
if line.startswith("CompanyID:"):
y=line.split(' ')
companyID=y[1]
if product in line:
x=True

Text file to CSV conversion

I have a text file which have content like :
Name: Aar saa
Last Name: sh
DOB: 1997-03-22
Phone: 1212222
Graduation: B.Tech
Specialization: CSE
Graduation Pass Out: 2019
Graduation Percentage: 60
Higher Secondary Percentage: 65
Higher Secondary School Name: Guru Nanak Dev University,amritsar
City: hyd
Venue Details: CMR College of Engineering & Technology (CMRCET) Medchal Road, TS � 501401
Name: bfdg df
Last Name: df
DOB: 2005-12-16
Phone: 2222222
Graduation: B.Tech
Specialization: EEE
Graduation Pass Out: 2018
Graduation Percentage: 45
Higher Secondary Percentage: 45
Higher Secondary School Name: asddasd
City: vjd
Venue Details: Prasad V. Potluri Siddhartha Institute Of Technology, Kanuru, AP - 520007
Name: cc dd ee
Last Name: ee
DOB: 1995-07-28
Phone: 444444444
Graduation: B.Tech
Specialization: ECE
Graduation Pass Out: 2019
Graduation Percentage: 75
Higher Secondary Percentage: 93
Higher Secondary School Name: Sasi institute of technology and engineering
City: hyd
Venue Details: CMR College of Engineering & Technology (CMRCET) Medchal Road, TS � 501401
I want to convert it CSV file with headers as
['Name', 'Last Name','DOB', 'Phone', 'Graduation','Specialization','Graduation Pass Out','Higher Secondary School Name','City','Venue Details']
with value as all the value after ':'
I have done something like this:
writer = csv.writer(open('result.csv', 'a'))
writer.writerow(['Name', 'Last Name','DOB', 'Phone', 'Graduation','Specialization','Graduation Pass Out','Graduation Percentage','Higher Secondary Percentage','Higher Secondary School Name','City','Venue Details'])
with open('Name2.txt') as f:
text = f.read()
myarray = text.split("\n\n")
for text1 in myarray:
parselines(text1, writer)
def parselines(lines,writer):
data=[]
for line in lines.split('\n'):
Name = line.split(": ",1)[1]
data.append(Name)
writer.writerow(data)
It worked but any efficient way would be much appreciated.
This algorithm works (kind-of a state machine)
If blank line, make a new row
Otherwise: add to current row, collect all headers and fields
def parselines(lines):
header = []
csvrows = [{}]
for line in lines:
line = line.strip()
if not line:
csvrows.append({}) # new row, in dict form
else:
field, data = line.split(":", 1)
csvrows[-1][field] = data
if field not in header:
header.append(field)
# format CSV
print(",".join(header))
for row in csvrows:
print(",".join(row.get(h,"") for h in header))

In a comma delimited String, keep all but second part

I have a bunch of addresses:
123 Main Street, PO Box 345, Chicago, IL 92921
1992 Super Way, Bakersfield, CA
234 Wonderland Lane, Attn: Daffy Duck, Orlando, FL 09922
How could I cut out the second string in there, when I do myStr.split(',') on each?
The idea is that I want to return:
123 Main Street, Chicago, IL 92921
1992 Super Way, CA
234 Wonderland Lane, Orlando, FL 09922
I could loop through each part, and build yet another string, skipping the second index, but was wondering if there's a better way to do so.
What I have now:
def filter_address(address):
print("Filtering address on",address)
updated_addr = ""
indx = 0
for section in address.split(","):
if indx != 1:
updated_addr = updated_addr + "," + section
indx += 1
updated_addr = updated_addr[1:] # This is to remove the leading `,`
new_address = filter_address("123 Main Street, Chicago, IL 92921")
You could use del in python and glue back the components of the string with ", " after splitting them.
For example:
address = "123 Main Street, PO Box 345, Chicago, IL 92921".split(",")
del address[1]
pretty_address = ", ".join(address)
print(pretty_address) # Gives 123 Main Street, Chicago, IL 92921

Dictionary text file Python

text
Donald Trump:
791697302519947264,1477604720,Ohio USA,Twitter for iPhone,5251,1895
Join me live in Springfield, Ohio!
Lit
<<<EOT
781619038699094016,1475201875,United States,Twitter for iPhone,31968,17246
While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...
<<<EOT
def read(text):
with open(text,'r') as f:
for line in f:
Is there a way that i can separate each information for the candidates So for example for Donald Trump it should be
[
[Donald Trump],
[791697302519947264[[791697302519947264,1477604720,'Ohio USA','Twitter for iPhone',5251,18951895], 'Join['Join me live in Springfield, Ohio! Lit']Lit']],
[781619038699094016[[781619038699094016,1475201875,'United States','Twitter for iPhone',31968,1724617246], 'While['While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...']']]
]
The format of the file is the following:
ID,DATE,LOCATION,SOURCE,FAVORITE_COUNT,RETWEET_COUNT text(the tweet)
So basically after the 6 headings, everything after that is a tweet till '<<
Also is there a way i can do this for every candidate in the file
I'm not sure why you need a multi-dimensional list (I would pick tuples and dictionaries if possible) but this seems to produce the output you asked for:
>>> txt = """Donald Trump:
... 791697302519947264,1477604720,Ohio USA,Twitter for iPhone,5251,1895
... Join me live in Springfield, Ohio!
... Lit
... <<<EOT
... 781619038699094016,1475201875,United States,Twitter for iPhone,31968,17246
... While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...
... <<<EOT
... Another Candidate Name:
... 12312321,123123213,New York USA, Twitter for iPhone,123,123
... This is the tweet text!
... <<<EOT"""
>>>
>>>
>>> buffer = []
>>> tweets = []
>>>
>>> for line in txt.split("\n"):
... if not line.startswith("<<<EOT"):
... buffer.append(line)
... else:
... if buffer[0].strip().endswith(":"):
... tweets.append([buffer.pop(0).rstrip().replace(":", "")])
... metadata = buffer.pop(0).split(",")
... tweet = [" ".join(line for line in buffer).replace("\n", " ")]
... tweets.append([metadata, tweet])
... buffer = []
...
>>>
>>> from pprint import pprint
>>>
>>> pprint(tweets)
[['Donald Trump'],
[['791697302519947264',
'1477604720',
'Ohio USA',
'Twitter for iPhone',
'5251',
'1895'],
['Join me live in Springfield, Ohio! Lit']],
[['781619038699094016',
'1475201875',
'United States',
'Twitter for iPhone',
'31968',
'17246'],
['While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney... ']],
['Another Candidate Name'],
[['12312321',
'123123213',
'New York USA',
' Twitter for iPhone',
'123',
'123'],
['This is the tweet text!']]]
>>>
I am not quite understanding... but here is my example to read a file line by line then add that line to a string of text to post to twitter.
candidates = open("FILEPATH WITH DOUBLE \") #example "C:\\users\\fox\\desktop\\candidates.txt"
for candidate in candidates():
candidate = candidate.rstrip('\n') #removes new line(this is mandatory)
#next line post means post to twitter
post("propaganda here " + candidate + "more propaganda)
note for every line in that file this code will post to twitter
ex.. 20 lines means twenty twitter posts

Resources