Python get first and last value from string using dictionary key values

Python get first and last value from string using dictionary key values - python-3.x

I have gotten a very strange data. I have dictionary with keys and values where I want to use this dictionary to search if these keywords are ONLY starting and/or end of the text not middle of the sentence. I tried to create simple data frame below to show the problem case and python codes that I have tried so far. How do I get it go search for only starting or ending of the sentence? This one searches whole text sub-strings.
Code:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
Original Dataframe:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
Code Tried out:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
Error:
TypeError: 'in <string>' requires string as left operand, not bool
when I use this x = (i for i in d if i.startswith(k) in k)
Empty values if i tried this x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
when i use this x = (i.startswith(k) for i in d if i in k)
Results from Code above ... Create new field 'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
Trying to get an FINAL output like this:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application

You need a matcher function which can accept flag and then call that twice to get the results for startswith and endswith.
def matcher(s, flag="start"):
if flag=="start":
for i in d:
if s.startswith(i):
return d[i]
else:
for i in d:
if s.endswith(i):
return d[i]
return None
df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]
The text_value column looks like:
0
1 Company
2 Company;Application
3 Application
Name: text_value, dtype: object

joined = "|".join(d.keys())
pat = '(?i)^(?:the\\s*)?(' + joined + ')\\b.*?|.*\\b(' + joined + ')$'+'|.*'
get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
df.text.str.replace(pat,get)
0
1 Company
2 Company;Application
3 Company;Application
Name: text, dtype: object

Related

(Neo4j / py2neo) Update relationship after it's been created

I am having troubles with updating a relationship property:
my goal is to map a dataset like the following into a Neo4j graph:
PersonName IllnessType
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 1
I basically cycle over the lines of this dataset, creating a Node for each Person and each Illness found on the line, and merging to avoid duplicates:
from py2neo import *
graph = Graph()
person_node= Node("Person", **kwargs)
graph.merge(person_node, "Person", "Name")
illness_node = Node("Illness", **kwargs)
graph.merge(illness_node, "Illness", "IllnessType")
edge = Relationship.type("SUFFERS_FROM")
rel = sfEdge(person_node, illness_node)
self.graph.merge(rel)
What I like to add now, is to add a weight on the "SUFFERS_FROM" edge that count how many times a person has suffered from a certain illness. What I tried to do was:
rm = RelationshipMatcher()
edge_to_increment = rm.match(nodes=(None, patNode), r_type=None).first()
if edge_to_increment is None:
edge_to_increment = edge(person_node, illness_node)
edge_to_increment["COUNT"]=1
self.graph.merge(edge_to_increment)
else:
edge_to_increment["COUNT"] += 1
c = e2r["COUNT"]
But then when I visualize the result, all edges have weight 1 even though the edge B-->1 should have weight 2.
Thanks in advance

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA

I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!

Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

Normalising units/Replace substrings based on lists using Python

I am trying to normalize weight units in a string.
Eg:
1.SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre - SUCO MARACUJA COM GENGIBRE PCS 300 ML
2. OVOS CAIPIRAS ANA MARIA BRAGA 10UN - OVOS CAIPIRAS ANA MARIA BRAGA 10U
3. SUCO MARACUJA MAMAO PCS 300 Gram - SUCO MARACUJA MAMAO PCS 300 G
4. SUCO ABACAXI COM MACA PCS 300Milli litre - SUCO ABACAXI COM MACA PCS 300ML
The keyword table is :
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
I tried to take up these lists as a table but am having difficulty in comparing two dataframes or tables in python.
I tried the below code.
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
z='SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre'
#for row in mongo_docs:
#z = row['clean_hntproductname']
for x in unit:
for y in norm_unit:
if (re.search(r'\s'+x+r'$',z,re.I)):
# clean_hntproductname = t.lower().replace(x.lower(),y.lower())
# myquery3 = { "_id" : row['_id']}
# newvalues3 = { "$set": {"clean_hntproductname" : 'clean_hntproductname'} }
# ds_hnt_prod_data.update_one(myquery3, newvalues3)
I'm using Python(Jupyter) with MongoDb(Compass). Fetching data from Mongo and writing back to it.

From my understanding you want to:
Update all the rows in a table which contain the words in the unit array, to the ones in norm_unit.
(Disclaimer: I'm not familiar with MongoDB or Python.)
What you want is to create a mapping (using a hash) of the words you want to change.
Here's a trivial solution (i.e. not best solution but would probably point you in the right direction.)
unit_conversions = {
'Kilo': 'KG'
'Kilogram': 'KG',
'Gram': 'G'
}
# pseudo-code
for each row that you want to update
item_description = get the value of the string in the column
for each key in unit_conversion (e.g. 'Kilo')
see if the item_description contains the key
if it does, replace it with unit_convertion[key] (e.g. 'KG')
update the row

Create multiple possible email addresses based on names in Python

Given a dataframe as follows:
firstname lastname email_address \
0 Doug Watson douglas.watson#dignityhealth.org
1 Nick Holekamp nick.holekamp#rankenjordan.org
2 Rob Schreiner rob.schriener#wellstar.org
3 Austin Phillips austin.phillips#precmed.com
4 Elise Geiger egeiger#puracap.com
5 Paul Urick purick#diplomatpharmacy.com
6 Michael Obringer michael.obringer#lashgroup.com
7 Craig Heneghan cheneghan#west-ward.com
8 Kathy Hirst kathleen.hirst#sunovion.com
9 Stefan Bluemmers stefan.bluemmers#grunenthal.com
companyname
0 Dignity Health
1 Ranken Jordan Pediatric Bridge Hospital
2 WellStar Health System
3 Precision Medical Products, Inc.
4 puracap.com
5 Diplomat Specialty Pharmacy
6 Lash Group
7 West-Ward Pharmaceuticals
8 Sunovion Pharmaceuticals
9 Grünenthal Group
How could I create possible email addresses using common email patterns as such: firstlast#example.com, first.last#example.com, f.last#example.com, lastF#example.com, first_last#example.com, firstL#example.com, etc.
df['email1'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() + '#' + df.companyname.str.replace('\s+', '').str.lower() + '.com'
print(df['email1'])
Out:
0 doug.watson#dignityhealth.com
1 nick.holekamp#rankenjordanpediatricbridgehospi... --->problematic
2 rob.schreiner#wellstarhealthsystem.com
3 austin.phillips#precisionmedicalproducts,inc..com --->problematic
4 elise.geiger#puracap.com.com --->problematic
...
9995 terry.hanley#kempersportsmanagement.com
9996 christine.marks#geocomp.com
9997 darryl.rickner#doe.com
9998 lalit.sharma#lovelylifestyle.com
9999 parul.dutt#infibeam.com
Some of them seems quite problematic, anyone could help to solve this issue? Thanks a lot.
EDITED:
print(df) after applying #Sajith Herath's solution:
Out:
firstname lastname companyname \
0 Nick Holekamp Ranken ...
email
0 nick. ...

You can use a method to create permutations of username with different separators and define a max length that simplify the domain using company name as follows
import pandas as pd
import random
data = {"firstname":["Nick"],"lastname":["Holekamp"],"companyname":["Ranken \
Jordan Pediatric Bridge Hospital"]}
df = pd.DataFrame(data=data)
max_char = 5
emails = []
def simplify_domain(text):
if len(text)>max_char:
text = ''.join([c for c in text if c.isupper()])
return text.lower()
return text.replace("\s+","").lower()
def username_permutations(first_name,last_name):
# define separators
separators = [".", "_", "-"]
#lower case
combinations = list(map(lambda x:f"{first_name.lower()}{x} \
{last_name.lower()}",separators))
#append a random number to tail
n = random.randint(1, 100)
combinations.extend(list(map(lambda x:f"{x}{n}",combinations)))
return combinations
for index,row in df.iterrows():
usernames = username_permutations(row["firstname"],row["lastname"])
email_permutations = list(map(lambda x: f" \
{x}#{simplify_domain(row['companyname'])}.com",usernames))
emails.append(','.join(email_permutations))
df["email"] = emails
Final result will be nick.holekamp#rjpbh.com,nick_holekamp#rjpbh.com,nick-holekamp#rjpbh.com,nick.holekamp66#rjpbh.com,nick_holekamp66#rjpbh.com,nick-holekamp66#rjpbh.com
you can modify simplify_domain method to validate given string such as removing inc or .com values

Add a space in lambda function

I have 2 columns in my data frame - ASIN and keywords . I am trying to groupby ASINs , the groupby is working fine
ASIN keywords
0 B07GFGXMZZ mangalagiri dress materials
1 B07GFGXMZZ pure cotton dress materials for women
2 B07GFGXMZZ suit material party wear for women
3 B076BL4CWB dhakai jamdani
4 B076BL4CWB jamdani
Groupby Code
df.groupby('ASIN').apply(lambda x: x.sum())
Output
but how to add a space for each lambda iteration as of now it is not doing , you can observe the same in output image i have linked
i Tried
df.groupby('ASIN').apply(lambda x:" ".join(x.sum()))
But it didn't work
ASIN
9801321261 98013212619801321261 cane mat with runnercane ...
B008YLNICE B008YLNICEB008YLNICEB008YLNICEB008YLNICEB008YL...
B00P81OJ26 B00P81OJ26B00P81OJ26B00P81OJ26B00P81OJ26B00P81...
B010SZBHEE B010SZBHEEB010SZBHEEB010SZBHEEB010SZBHEEB010SZ...
B01143KAY2 B01143KAY2B01143KAY2B01143KAY2B01143KAY2B01143...
B0157XMD4A B0157XMD4A elephant painted box
B0157XMRJ6 B0157XMRJ6B0157XMRJ6B0157XMRJ6B0157XMRJ6B0157X...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python get first and last value from string using dictionary key values - python-3.x

joined = "|".join(d.keys()) pat = '(?i)^(?:the\\s)?(' + joined + ')\\b.?|.\\b(' + joined + ')$'+'|.' get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '') df.text.str.replace(pat,get) 0 1 Company 2 Company;Application 3 Company;Application Name: text, dtype: object

Related

(Neo4j / py2neo) Update relationship after it's been created

running for loop until arbitrary index (python 3.x)

Normalising units/Replace substrings based on lists using Python

Create multiple possible email addresses based on names in Python

Add a space in lambda function

Categories

Resources