How to capture words spread through multiple lines which have anywhite space(newline, space, tab) - python-3.x

import re
c = """
class_monitor std4:
Name: xyz
Roll number: 123
Age: 9
Badge: Blue
class_monitor std5:
Name: abc
Roll number: 456
Age: 10
Badge: Red
"""
I want to print Name, Roll number and age for std4 and Name, roll number and badge for std5.
pat = (class_monitor)(.*4:)(\n|\s|\t)*(Name:)(.*)(\s|\n|\t)*(Roll number:)(.*)(\s|\n|\t)*(Age:)(.*)(\s|\n|\t)*(Badge:)(.*)
it matches the respective std if I toggle the second group (.*4:) to (.*5:) in pythex.
However, in a script mode, it is not working. Am I missing something here?

Related

Best way to handle element of dict that has multiple key/value pairs inside it

[{'id': 2, 'Registered Address': 'Line 1: 1 Any Street Line 2: Any locale City: Any City Region / State: Any Region Postcode / Zip code: BA2 2SA Country: GB Jurisdiction: Any Jurisdiction'}]
I have the above read into a dataframe and that is the output so far. The issue is I need to break out the individual elements - due to names of places etc the values may or may not have spaces in them - looking at the above my keys are Line 1, Line 2, City, Region / State, Postcode / Zip, Country, Jurisdiction.
Output required for the "Registered Address"-'key'is the keys and values
"Line 1": "1 Any Street"
"Line 2": "Any locale"
"City": "Any City"
"Region / State": "Any Region"
"Postcode / Zip code": "BA2 2SA"
"Country": "GB"
"Jurisdiction": "Any Jurisdiction"
Just struggling to find a way to get to the end result.I have tried to pop out and use urllib.prse but fell short - is anypone able to point me in the best direction please?
Tried to write a code that generalizes your question, but there were some limitations, regarding your data format. Anyway I would do this:
def address_spliter(my_data, my_keys):
address_data = my_data[0]['Registered Address']
key_address = {}
for i,k in enumerate(keys):
print(k)
if k == 'Jurisdiction:':
key_address[k] = address_data.split('Jurisdiction:')[1].removeprefix(' ').removesuffix(' ')
else:
key_address[k] = address_data.split(k)[1].split(keys[i+1])[0].removeprefix(' ').removesuffix(' ')
return key_address
were you can call this function like this:
my_data = [{'id': 2, 'Registered Address': 'Line 1: 1 Any Street Line 2: Any locale City: Any City Region / State: Any Region Postcode / Zip code: BA2 2SA Country: GB Jurisdiction: Any Jurisdiction'}]
and
my_keys = ['Line 1:','Line 2:','City:', 'Region / State:', 'Postcode / Zip code:', 'Country:', 'Jurisdiction']
As you can see It'll work if only the sequence of keys is not changed. But anyway, you can work around this idea and change it base on your problem accordingly if it doesn't go as expected.

After a seperator there is a key in loop. How to keep it?

---------------------------
CompanyID: 000000000000
Pizza: 2 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
---------------------------
CompanyID: 000000000001
burger: 1 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
--------------------------
I'm doing a for loop over a list of lines. Every line is an item of a list. I need to keep the companyID while looking for a user input.
While this is printing the variable x=True. I cant take company ID to print it.
a='-'
for line in lines:
if a in line:
companyID= next(line)
if product in line:
x=True
TypeError: 'str' object is not an iterator
You can use your line seperator to identify when new data starts. Once you see the line with "----" then you can start collecing info in a new dictionary. for each line take its key and value by splitting on ":" and create the entry in the dictionary.
When you see the next "----" line you know thats the end of the data for this company so then do your check to see if they have the product and if so print the company id from the dictionary.
line_seperator_char = '-'
company_data = {}
product = 'burger'
with open('data.dat') as lines:
for line in lines:
line = line.rstrip()
if line.startswith(line_seperator_char):
if product in company_data:
print(f'{company_data["CompanyID"]} contains the product {product}')
company_data = {}
else:
key, value = line.split(':')
company_data[key] = value
OUTPUT
000000000001 contains the product burger
No it doesnt run. Could you explain what does "[1] means near split()[1]?
Another try that doesnt run is
y=[]
y=lines[1].split(' ')
for line in lines:
y=line.split(' ')
if len(y[1])==10:
companyID=y[1]
if product in line:
x=True
Thanks for the answers.Something that finally worked in my case was that:
y=[]
y=line[1].split(' ')
a='-'
for line in lines:
if line.startswith("CompanyID:"):
y=line.split(' ')
companyID=y[1]
if product in line:
x=True

Python get first and last value from string using dictionary key values

I have gotten a very strange data. I have dictionary with keys and values where I want to use this dictionary to search if these keywords are ONLY starting and/or end of the text not middle of the sentence. I tried to create simple data frame below to show the problem case and python codes that I have tried so far. How do I get it go search for only starting or ending of the sentence? This one searches whole text sub-strings.
Code:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
Original Dataframe:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
Code Tried out:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
Error:
TypeError: 'in <string>' requires string as left operand, not bool
when I use this x = (i for i in d if i.startswith(k) in k)
Empty values if i tried this x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
when i use this x = (i.startswith(k) for i in d if i in k)
Results from Code above ... Create new field 'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
Trying to get an FINAL output like this:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application
You need a matcher function which can accept flag and then call that twice to get the results for startswith and endswith.
def matcher(s, flag="start"):
if flag=="start":
for i in d:
if s.startswith(i):
return d[i]
else:
for i in d:
if s.endswith(i):
return d[i]
return None
df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]
The text_value column looks like:
0
1 Company
2 Company;Application
3 Application
Name: text_value, dtype: object
joined = "|".join(d.keys())
pat = '(?i)^(?:the\\s*)?(' + joined + ')\\b.*?|.*\\b(' + joined + ')$'+'|.*'
get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
df.text.str.replace(pat,get)
0
1 Company
2 Company;Application
3 Company;Application
Name: text, dtype: object

Text file to CSV conversion

I have a text file which have content like :
Name: Aar saa
Last Name: sh
DOB: 1997-03-22
Phone: 1212222
Graduation: B.Tech
Specialization: CSE
Graduation Pass Out: 2019
Graduation Percentage: 60
Higher Secondary Percentage: 65
Higher Secondary School Name: Guru Nanak Dev University,amritsar
City: hyd
Venue Details: CMR College of Engineering & Technology (CMRCET) Medchal Road, TS � 501401
Name: bfdg df
Last Name: df
DOB: 2005-12-16
Phone: 2222222
Graduation: B.Tech
Specialization: EEE
Graduation Pass Out: 2018
Graduation Percentage: 45
Higher Secondary Percentage: 45
Higher Secondary School Name: asddasd
City: vjd
Venue Details: Prasad V. Potluri Siddhartha Institute Of Technology, Kanuru, AP - 520007
Name: cc dd ee
Last Name: ee
DOB: 1995-07-28
Phone: 444444444
Graduation: B.Tech
Specialization: ECE
Graduation Pass Out: 2019
Graduation Percentage: 75
Higher Secondary Percentage: 93
Higher Secondary School Name: Sasi institute of technology and engineering
City: hyd
Venue Details: CMR College of Engineering & Technology (CMRCET) Medchal Road, TS � 501401
I want to convert it CSV file with headers as
['Name', 'Last Name','DOB', 'Phone', 'Graduation','Specialization','Graduation Pass Out','Higher Secondary School Name','City','Venue Details']
with value as all the value after ':'
I have done something like this:
writer = csv.writer(open('result.csv', 'a'))
writer.writerow(['Name', 'Last Name','DOB', 'Phone', 'Graduation','Specialization','Graduation Pass Out','Graduation Percentage','Higher Secondary Percentage','Higher Secondary School Name','City','Venue Details'])
with open('Name2.txt') as f:
text = f.read()
myarray = text.split("\n\n")
for text1 in myarray:
parselines(text1, writer)
def parselines(lines,writer):
data=[]
for line in lines.split('\n'):
Name = line.split(": ",1)[1]
data.append(Name)
writer.writerow(data)
It worked but any efficient way would be much appreciated.
This algorithm works (kind-of a state machine)
If blank line, make a new row
Otherwise: add to current row, collect all headers and fields
def parselines(lines):
header = []
csvrows = [{}]
for line in lines:
line = line.strip()
if not line:
csvrows.append({}) # new row, in dict form
else:
field, data = line.split(":", 1)
csvrows[-1][field] = data
if field not in header:
header.append(field)
# format CSV
print(",".join(header))
for row in csvrows:
print(",".join(row.get(h,"") for h in header))

edit two parts of text document python

Similar to my recent question asked:
I have a text file contain some data using this piece of code
def Add_score():
with open("users.txt") as myFile:
for num, line in enumerate(myFile, 1):
if name in line:
line_found = num
break
It finds the line that has a specific name. The line would look like this.
Name: whatever Username: whatever password: whatever score: 25 goes: 3
I need to be able to add number to score as well as goes
Change 3 to 4 and change 25 to 26
Here you are:
line = 'Name: Username: password: whatever score: 25 goes: 3'
print(line)
lineSplitted = line.split()
print(lineSplitted)
updatedLine = " ".join(lineSplitted[0:5] + [str(int(lineSplitted[5])+1)] + [lineSplitted[6]] + [str(int(lineSplitted[7])+1)])
print(updatedLine)
prints:
Name: Username: password: whatever score: 25 goes: 3
['Name:', 'Username:', 'password:', 'whatever', 'score:', '25', 'goes:', '3']
Name: Username: password: whatever score: 26 goes: 4

Resources