Extracting countries from string - python-3.x

I am trying to go through a column of data frame in python 3. What I need to do is take from each row the country that it is mentioned and the number of times that country is mentioned.
i.e. if I have this row:
['[Aydemir, Deniz', ' Gunduz, Gokhan', ' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey', ' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden']
it needs to output a list: ['Turkey', 'Sweden']
and if I have this row:
['[Fang, Qun', ' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China', ' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China']
the output should be: ['China', 'China'].
I have written this code but it is not working as I want to:
from geotext import GeoText
sentence = df.iloc[0,0]
places = GeoText(sentence)
print(places.countries)
It prints only the country once and in some cases when it is USA it doesn't recognize the abbreviation. Can you help me figure out what to do?
l = [['[Aydemir, Deniz\', \' Gunduz, Gokhan\', \' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey\', \' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden',1990],
['[Fang, Qun\', \' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China\', \' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China',2005],
['[Blumentritt, Melanie\', \' Gardner, Douglas J.\', \' Shaler, Stephen M.] Univ Maine, Sch Resources, Orono, ME USA\', \' [Cole, Barbara J. W.] Univ Maine, Dept Chem, Orono, ME 04469 USA',2012]]
dataf = pd.DataFrame(l, columns = ['Authors', 'Year'])
I tried to do this code but I have the same problem, it doesn't give all the counties only one per row:
def find_country(n):
for c in pycountry.countries:
if str(c.name).lower() in n.lower():
return c.name
country1 = (dataf['Authors']
.replace(r"\bUSA\b", "United States", regex=True)
.apply(lambda x: find_country(x)))

USA does not seem to be detected correctly by geotext - it's worth trying to raise an issue with that package. As a workaround here, I replace USA with United States, which is correctly detected.
df = (dataf['Authors']
.replace(r"\bUSA\b", "United States", regex=True)
.apply(lambda x: geotext.GeoText(x).countries)
)
I'm not sure what you were doing before, but this will get the list of countries for each of the rows in Author, including duplicates.
0 [Turkey, Sweden]
1 [China, China]
2 [United States, United States]
Name: Authors, dtype: object
As mentioned in the comment, if you want to have an actual list of lists, just add tolist() to the end.
df.tolist()
[['Turkey', 'Sweden'], ['China', 'China'], ['United States', 'United States']]

Related

How to search for specific text in csv within a Pandas, python

Hello I want to find the account text # in the title column, and save it in the new csv. Pandas can do it, I tried to make it but it didn't work.
This is my csv http://www.sharecsv.com/s/c1ed9790f481a8d452049be439f4e3d8/Newnormal.csv
this is my code:
import pandas as pd
data = pd.read_csv("Newnormal.csv")
data.dropna(inplace = True)
sub ='#'
data["Indexes"]= data["title"].str.find(sub)
print(data)
I want results like this
From, to, title Xavier5501,KudiiThaufeeq,RT #KudiiThaufeeq: Royal
Rape, Royal Harassment, Royal Cocktail Party, Royal Pedo, Royal
Bidding, Royal Maalee Bayaan, Royal Slavery..et
Thank you.
reduce records to only those that have an "#" in title
define new column which is text between "#" and ":"
you are left with some records where this leave NaN in to column. I've just filtered these out
df = pd.read_csv("Newnormal.csv")
df = df[df["title"].str.contains("#")==True]
df["to"] = df["title"].str.extract(r".*([#][A-Z,a-z,0-9,_]+[:])")
df = df[["from","to","title"]]
df[~df["to"].isna()].to_csv("ToNewNormal.csv", index=False)
df[~df["to"].isna()]
output
from to title
1 Xavier5501 #KudiiThaufeeq: RT #KudiiThaufeeq: Royal Rape, Royal Harassmen...
2 Suzane24979006 #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou...
3 sandeep_sprabhu #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou...
4 oliLince #Timothy_Hughes: RT #Timothy_Hughes: How to Get a Salesforce Th...
7 rismadwip #danielepermana: RT #danielepermana: Pak kasus covid per hari s...
... ... ... ...
992 Reptoid_Hunter #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha...
994 KPCResearch #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha...
995 GreySparkUK #VoxSmartGlobal: RT #VoxSmartGlobal: The #newnormal will see mo...
997 Gabboa10 #HuShameem: RT #HuShameem: One of #PGO_MV admin staff test...
999 wanjirunjendu #ntvkenya: RT #ntvkenya: AAK's Mugure Njendu shares insig...

Create multiple possible email addresses based on names in Python

Given a dataframe as follows:
firstname lastname email_address \
0 Doug Watson douglas.watson#dignityhealth.org
1 Nick Holekamp nick.holekamp#rankenjordan.org
2 Rob Schreiner rob.schriener#wellstar.org
3 Austin Phillips austin.phillips#precmed.com
4 Elise Geiger egeiger#puracap.com
5 Paul Urick purick#diplomatpharmacy.com
6 Michael Obringer michael.obringer#lashgroup.com
7 Craig Heneghan cheneghan#west-ward.com
8 Kathy Hirst kathleen.hirst#sunovion.com
9 Stefan Bluemmers stefan.bluemmers#grunenthal.com
companyname
0 Dignity Health
1 Ranken Jordan Pediatric Bridge Hospital
2 WellStar Health System
3 Precision Medical Products, Inc.
4 puracap.com
5 Diplomat Specialty Pharmacy
6 Lash Group
7 West-Ward Pharmaceuticals
8 Sunovion Pharmaceuticals
9 Grünenthal Group
How could I create possible email addresses using common email patterns as such: firstlast#example.com, first.last#example.com, f.last#example.com, lastF#example.com, first_last#example.com, firstL#example.com, etc.
df['email1'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() + '#' + df.companyname.str.replace('\s+', '').str.lower() + '.com'
print(df['email1'])
Out:
0 doug.watson#dignityhealth.com
1 nick.holekamp#rankenjordanpediatricbridgehospi... --->problematic
2 rob.schreiner#wellstarhealthsystem.com
3 austin.phillips#precisionmedicalproducts,inc..com --->problematic
4 elise.geiger#puracap.com.com --->problematic
...
9995 terry.hanley#kempersportsmanagement.com
9996 christine.marks#geocomp.com
9997 darryl.rickner#doe.com
9998 lalit.sharma#lovelylifestyle.com
9999 parul.dutt#infibeam.com
Some of them seems quite problematic, anyone could help to solve this issue? Thanks a lot.
EDITED:
print(df) after applying #Sajith Herath's solution:
Out:
firstname lastname companyname \
0 Nick Holekamp Ranken ...
email
0 nick. ...
You can use a method to create permutations of username with different separators and define a max length that simplify the domain using company name as follows
import pandas as pd
import random
data = {"firstname":["Nick"],"lastname":["Holekamp"],"companyname":["Ranken \
Jordan Pediatric Bridge Hospital"]}
df = pd.DataFrame(data=data)
max_char = 5
emails = []
def simplify_domain(text):
if len(text)>max_char:
text = ''.join([c for c in text if c.isupper()])
return text.lower()
return text.replace("\s+","").lower()
def username_permutations(first_name,last_name):
# define separators
separators = [".", "_", "-"]
#lower case
combinations = list(map(lambda x:f"{first_name.lower()}{x} \
{last_name.lower()}",separators))
#append a random number to tail
n = random.randint(1, 100)
combinations.extend(list(map(lambda x:f"{x}{n}",combinations)))
return combinations
for index,row in df.iterrows():
usernames = username_permutations(row["firstname"],row["lastname"])
email_permutations = list(map(lambda x: f" \
{x}#{simplify_domain(row['companyname'])}.com",usernames))
emails.append(','.join(email_permutations))
df["email"] = emails
Final result will be nick.holekamp#rjpbh.com,nick_holekamp#rjpbh.com,nick-holekamp#rjpbh.com,nick.holekamp66#rjpbh.com,nick_holekamp66#rjpbh.com,nick-holekamp66#rjpbh.com
you can modify simplify_domain method to validate given string such as removing inc or .com values

Python get first and last value from string using dictionary key values

I have gotten a very strange data. I have dictionary with keys and values where I want to use this dictionary to search if these keywords are ONLY starting and/or end of the text not middle of the sentence. I tried to create simple data frame below to show the problem case and python codes that I have tried so far. How do I get it go search for only starting or ending of the sentence? This one searches whole text sub-strings.
Code:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
Original Dataframe:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
Code Tried out:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
Error:
TypeError: 'in <string>' requires string as left operand, not bool
when I use this x = (i for i in d if i.startswith(k) in k)
Empty values if i tried this x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
when i use this x = (i.startswith(k) for i in d if i in k)
Results from Code above ... Create new field 'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
Trying to get an FINAL output like this:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application
You need a matcher function which can accept flag and then call that twice to get the results for startswith and endswith.
def matcher(s, flag="start"):
if flag=="start":
for i in d:
if s.startswith(i):
return d[i]
else:
for i in d:
if s.endswith(i):
return d[i]
return None
df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]
The text_value column looks like:
0
1 Company
2 Company;Application
3 Application
Name: text_value, dtype: object
joined = "|".join(d.keys())
pat = '(?i)^(?:the\\s*)?(' + joined + ')\\b.*?|.*\\b(' + joined + ')$'+'|.*'
get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
df.text.str.replace(pat,get)
0
1 Company
2 Company;Application
3 Company;Application
Name: text, dtype: object

Creating a table using a list of strings

I am needing to convert a list of lists of strings into a three column table where the first column is 1 space longer than the longest string. I have figured out how to identify the longest string and how long it is, but getting the table to form has been quite tricky. Here is the program with the lists in it and it shows you that the longest one is 26 characters long.
def main():
mycities = [['Cape Girardeau', 'MO', '63780'], ['Columbia', 'MO', '65201'],
['Kansas City', 'MO', '64108'], ['Rolla', 'MO', '65402'],
['Springfield', 'MO', '65897'], ['St Joseph', 'MO', '64504'],
['St Louis', 'MO', '63111'], ['Ames', 'IA', '50010'], ['Enid',
'OK', '73773'], ['West Palm Beach', 'FL', '33412'],
['International Falls', 'MN', '56649'], ['Frostbite Falls',
'MN', '56650']]
col_width = max(len(item) for sub in mycities for item in sub)
print(col_width)
main()
Now I am just needing to get it to print off like this:
Cape Girardeau MO 63780
Columbia MO 65201
Kansas City MO 64108
Springfield MO 65897
St Joseph MO 64504
St Louis MO 63111
Ames IA 50010
Enid OK 73773
West Palm Beach FL 33412
International Falls MN 56649
Frostbite Falls MN 56650
You're off to the right start -- as an example given the specific structure to the lists you have, you can use the col_width you calculated to determine the number of spaces you'd need after the name of each city to append to the end of each city name:
for city in mycities:
# append the string with the number of spaces required
city_padded = city[0] + " " + " "*(col_width-len(city[0]))
print(city_padded + city[1] + " " + city[2])
Given your example, this will produce:
Cape Girardeau MO 63780
Columbia MO 65201
Kansas City MO 64108
Rolla MO 65402
Springfield MO 65897
St Joseph MO 64504
St Louis MO 63111
Ames IA 50010
Enid OK 73773
West Palm Beach FL 33412
International Falls MN 56649
Frostbite Falls MN 56650
Note in the original version of your question, you are missing commas in your sublists in your mycities variable, for which I've added in an edit.
As a side note, it is convention in Python that words be separated by underscores in variable names for readability, so you might rename mycities to my_cities.
pep8 ref: (https://www.python.org/dev/peps/pep-0008/#function-and-variable-names)
"String name".ljust(26) will add spaces to the end of your string. For example,
Ames.ljust(26) will result in 'Ames (22 spaces here)', and then the next column will print after. If you are not sure what the longest city will be, you could replace the 26 with len(cities[-1]) after ordering the cities in a list by length. To do this, you can do sortedCities = sorted(cityListVariable, key=len)
def main():
cities = ['Cape Girardeau, MO 63780', 'Columbia, MO 65201', 'Kansas City, MO 64108', 'Rolla, MO 65402',
'Springfield, MO 65897', 'St Joseph, MO 64504', 'St Louis, MO 63111', 'Ames, IA 50010',
'Enid, OK 73773', 'West Palm Beach, FL 33412', 'International Falls, MN 56649', 'Frostbite Falls, MN 56650',
'Charlotte, NC 28241', 'Upper Marlboro, MD 20774', 'Camdenton, MO 65020', 'San Fransisco, CA 94016'] #create list of information
for x in cities:
col = x.split(",")
if(len(col) == 2):
city = col[0].strip()
temp = col[1].strip()
else:
city = x[:15].strip()
temp = c[15:].strip()
state = temp[:2]
zipCode = int(temp[-5::])
print("%-20s\t%s\t%d"%(city, state, zipCode))
main()

Reformat csv file using python?

I have this csv file with only two entries. Here it is:
Meat One,['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']
First one is title and second is a business headings.
Problem lies with entry two.
Here is my code:
import csv
with open('phonebookCOMPK-Directory.csv', "rt") as textfile:
reader = csv.reader(textfile)
for row in reader:
row5 = row[5].replace("[", "").replace("]", "")
listt = [(''.join(row5))]
print (listt[0])
it prints:
'Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers'
What i need to do is that i want to create a list containing these words and then print them like this using for loop to print every item separately:
Abattoirs
Exporters
Food Delivery
Butchers Retail
Meat Dealers-Retail
Meat Freezer
Meat Packers
Actually I am trying to reformat my current csv file and clean it so it can be more precise and understandable.
Complete 1st line of csv is this:
Meat One,+92-21-111163281,Al Shaheer Corporation,Retailers,2008,"['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']","[[' Outlets Address : Shop No. Z-10, Station Shopping Complex, MES Market, Malir-Cantt, Karachi. Landmarks : MES Market, Station Shopping Complex City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi. Landmarks : Nadra Chowrangi, Sky Garden, Tipu Sultan Road City : Karachi UAN : +92-21-111163281 '], ["" Outlets Address : Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi. Landmarks : Boat Basin, Jans Broast, Khayaban-e-Roomi City : Karachi UAN : +92-21-111163281 View Map ""], [' Outlets Address : Gulistan-e-Johar, Karachi. Landmarks : Perfume Chowk City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Tee Emm Mart, Creek Vista Appartments, Khayaban-e-Shaheen, Phase VIII, DHA, Karachi. Landmarks : Creek Vista Appartments, Nueplex Cinema, Tee Emm Mart, The Place City : Karachi Mobile : 0302-8333666 '], [' Outlets Address : Y-Block, DHA, Lahore. Landmarks : Y-Block City : Lahore UAN : +92-42-111163281 '], [' Outlets Address : Adj. PSO, Main Bhittai Road, Jinnah Supermarket, F-7 Markaz, Islamabad. Landmarks : Bhittai Road, Jinnah Super Market, PSO Petrol Pump City : Islamabad UAN : +92-51-111163281 ']]","Agriculture, fishing & Forestry > Farming equipment & services > Abattoirs in Pakistan"
First column is Name
Second column is Number
Third column is Owner
Forth column is Business type
Fifth column is Y.O.E
Sixth column is Business Headings
Seventh column is Outlets (List of lists containing every branch address)
Eighth column is classification
There is no restriction of using csv.reader, I am open to any technique available to clean my file.
Think of it in terms of two separate tasks:
Collect some data items from a ‘dirty’ source (this CSV file)
Store that data somewhere so that it’s easy to access and manipulate programmatically (according to what you want to do with it)
Processing dirty CSV
One way to do this is to have a function deserialize_business() to distill structured business information from each incoming line in your CSV. This function can be complex because that’s the nature of the task, but still it’s advisable to split it into self-containing smaller functions (such as get_outlets(), get_headings(), and so on). This function can return a dictionary but depending on what you want it can be a [named] tuple, a custom object, etc.
This function would be an ‘adapter’ for this particular CSV data source.
Example of deserialization function:
def deserialize_business(csv_line):
"""
Distills structured business information from given raw CSV line.
Returns a dictionary like {name, phone, owner,
btype, yoe, headings[], outlets[], category}.
"""
pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]
name = pieces[0]
phone = pieces[1]
owner = pieces[2]
btype = pieces[3]
yoe = pieces[4]
# after yoe headings begin, until substring Outlets Address
headings = pieces[4:pieces.index("Outlets Address")]
# outlets go from substring Outlets Address until category
outlet_pieces = pieces[pieces.index("Outlets Address"):-1]
# combine each individual outlet information into a string
# and let ``deserialize_outlet()`` deal with that
raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]
# category is the last piece
category = pieces[-1]
return {
'name': name,
'phone': phone,
'owner': owner,
'btype': btype,
'yoe': yoe,
'headings': headings,
'outlets': outlets,
'category': category,
}
Example of calling it:
with open("phonebookCOMPK-Directory.csv") as f:
lineno = 0
for line in f:
lineno += 1
try:
business = deserialize_business(line)
except:
# Bad line formatting?
log.exception(u"Failed to deserialize line #%s!", lineno)
else:
# All is well
store_business(business)
Storing the data
You’ll have the store_business() function take your data structure and write it somewhere. Maybe it’ll be another CSV that’s better structured, maybe multiple CSVs, a JSON file, or you can make use of SQLite relational database facilities since Python has it built-in.
It all depends on what you want to do later.
Relational example
In this case your data would be split across multiple tables. (I’m using the word “table” but it can be a CSV file, although you can as well make use of an SQLite DB since Python has that built-in.)
Table identifying all possible business headings:
business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers
Table identifying all possible categories:
category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"
Table identifying businesses:
business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3
Table describing their outlets:
business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281
Table describing their headings:
business ID, business heading ID
1, 1
1, 2
1, 3
…
Handling all this would require a complex store_business() function. It may be worth looking into SQLite and some ORM framework, if going with relational way of keeping the data.
You can just replace the line :
print(listt[0])
with :
print(*listt[0], sep='\n')

Resources