I have a large data set and looking for something that will split my Street Address into two columns Street Number and Street Name.
I am trying to figure out how can I do this efficiently since I first need to process the street address and then check if the first index of the split has a digit or not.
So far I have a working code that looks like this. I created a two function one for extracting street number data from the street address, while the other one replaces the first occurrence of that street number from the street address.
def extract_street_number(row):
if any(map(str.isdigit, row.split(" ")[0])):
return row.split(" ")[0]
def extract_street_name(address, streetnumber):
if streetnumber:
return address.replace(streetnumber, "", 1)
else:
return address
Then using the apply function to have the two columns.
df[street_number] = df.apply(lambda row: extract_street_number(row[address_col]), axis=1)
df[street_name] = df.apply(lambda row: extract_street_name(row[address_col], row[street_number]), axis=1)
I'm wondering if there is a more efficient way to do this? Based on this current routine I need to build first the Street Number Column before I process the street name column.
I'm thinking of something like building the two series on the first iteration of the address column. The pseudo-code is something like this I just can't figure it out how can I code it in python.
Pseudocode:
Split Address into two columns based on first space that encounters a non-numeric character:
street_data = address.split(" ", maxsplit=1)
If street_data[0] has digits then return the columns on this way:
df[street_number] = street_data[0]
df[street_name] = street_data[1]
Else if street_data[0] is not digit then return the columns on this way:
df[street_number] = ""
df[street_name] = street_data[0] + " " + street_data[1]
# or just simply the address
df[street_name] = address
By the way this is the working sample of the data:
# In
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St', '513 Banks St', '5600 77 Center Dr', '1013 1/2 E Main St', '1234C Main St', '37-01 Fair Lawn Ave']})
# Out
Street_Number Street_Name
0 111 Rubin Center
1 Monroe St
2 513 Banks St
3 560 77 Center Dr
4 1013 1/2 E Main St
5 1234C Main St
6 37-01 Fair Lawn Ave
TL;DR:
This can be achieved in three steps-
Step 1-
df['Street Number'] = [street_num[0] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 2-
df['Street Address'] = [street_num[1] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 3-
df['Street Address'].loc[df['Street Address'].str.contains("N/A") == True] = df1['Address'].loc[df1['Street Address'].str.contains("N/A") == True]
Explanation-
Added two more test cases in the dataframe for code flexibility (Row 7,8)-
Step 1 - We separate the street numbers from the address here. This is done by slicing the first element from the list after splitting the address string and initialising to Street Number column.
If the first element doesn't contain a number, N/A is appended in the Street Number column.
Step 2 - As the first element in the sliced string contains the Street Number, the second element has to be the Street Address hence is appended to the Street Address column.
Step 3 - Due to step two, the Street Address become 'N/A' for the 'Address` that do not contain a number and that is resolved by this -
Hence, we can solve this in three steps after hours of struggle put in.
solution reflecting your pseudocode is below.
First lets divide "Address" and store is somewhere
new = df["Address"].str.split(" ", n = 1, expand = True)
df["First Part"]= new[0]
df["Last Part"]= new[1]
Next let's write down conditions
cond1 = df['First Part'].apply(str.isdigit)
cond2 = df['Last Part'].apply(str.isdigit)
Now check what meets given conditions
df.loc[cond1 & ~cond2, "Street"] = df.loc[cond1 & ~cond2, "Last Part"]
df.loc[cond1 & ~cond2, "Number"] = df.loc[cond1 & ~cond2, "First Part"]
df.loc[~cond1 & ~cond2, "Street"] = df.loc[~cond1 & ~cond2, ['First Part', 'Last Part']].apply(lambda x: x[0] + ' ' + x[1], axis = 1)
Finally let's clean-up those auxiliary columns
df.drop(["First Part", "Last Part"], axis = 1, inplace=True)
df
Address Street Number
0 111 Rubin Center Rubin Center 111
1 Monroe St Monroe St NaN
2 513 Banks St Banks St 513
#mock test
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St',
'513 Banks St', 'Banks 513 St',
'Rub Cent 111']})
unless i'm missing something, a bit of regex should solve ur request:
#gets number only if it starts the line
df['Street_Number'] = df.Address.str.extract(r'(^\d+)')
#splits only if number is at the start of the line
df['Street_Name'] = df.Address.str.split('^\d+').str[-1]
Address street_number street_name
0 111 Rubin Center 111 Rubin Center
1 Monroe St NaN Monroe St
2 513 Banks St 513 Banks St
3 Banks 513 St NaN Banks 513 St
4 Rub Cent 111 NaN Rub Cent 111
let me know where this falls flat
Related
I have a student rank dataset in which a few values are missing and I want to do fuzzy logic on
names and rank columns within the same dataset, find the best matching values, update null values for the rest of the columns, and add a matched name column, matched rank column, and score. I'm a beginner that would be great if someone
help me. Thank You.
data:
Name School Marks Location Rank
0 JACK TML 90 AU 3
1 JHON SSP 85 NULL NULL
2 NULL TML NULL AU 3
3 BECK NTC NULL EU 2
4 JHON SSP NULL JP 1
5 SEON NTC 80 RS 5
Expected Data Output:
data:
Name School Marks Location Rank Matched_Name Matched_Rank Score
0 JACK TML 90 AU 3 Jack 3 100
1 JHON SSP 85 JP 1 JHON 1 100
2 BECK NTC NULL EU 2 - - -
3 SEON NTC 80 RS 5 - - -
I how to do it with fuzzy logic ?
here is my code
ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)
# Columns to match on from df_left
left_on = ["Name", "Rank"]
# Columns to match on from df_right
right_on = ["Name", "Rank"]
# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
ds2,
left_on,
right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)
print(matched_results)
try:
print(matched_results.columns)
cols = matched_results.columns
except:
pass
print(matched_results.to_csv('matched_results.csv',index=False))
# Let's see the best matches
try:
matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
pass
Using fuzzywuzzy is usually when names are not exact matches. I can't see this in your case. However, if your names aren't exact matches, you may do the following:
Create a list of all school names using
df['school_name'].tolist()
Find null values in your data frame.
Use
process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)
Just remember that you should never use Fuzzy if you have exact names. You'll only need to filter the data frame like this:
filtered = df[df['school_name'] == x]
and use it to replace values in the original data frame.
Given a dataframe
data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'],
['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']]
df = pd.DataFrame(data, columns = ['Customer','Sequence'])
Calculate the following:
First Sequence in each group is assigned a GroupID of 1.
Compare first Sequence to subsequent Sequence values in each group.
If difference is greater than .5, increment GroupID.
If GroupID was incremented, instead of comparing subsequent values to the first, use the current Sequence.
In the desired results table below...
Bob only has 1 record so the GroupID is 1.
Alice has 2 records and the difference between the two Sequence values (46 & 47) is greater than .5 so the GroupID is incremented.
Charlie's Sequence values are all the same, so all records get GroupID 1.
For Doug, the difference between the first two Sequence values (23 & 35) is greater than .5, so the GroupID for the second Sequence becomes 2. Now, since the GroupID was incremented, I want to compare the next value of 35.5 to 35, not 23, which means the last two rows share the same GroupID.
Desired results:
CustomerID
Sequence
GroupID
Bob
25
1
Alice
46
1
Alice
47
2
Charlie
19
1
Charlie
19
1
Charlie
19
1
Doug
23
1
Doug
35
2
Doug
35.5
2
My implementation:
# generate unique ID based on each customers Sequence
df['EventID'] = df.groupby('Customer')[
'Sequence'].transform(lambda x: pd.factorize(x)[0]) + 1
# impute first Sequence for each customer for comparison
df['FirstSeq'] = np.where(
df['EventID'] == 1, df['Sequence'], np.nan
)
# groupby and fill first Sequence forward
df['FirstSeq'] = df.groupby('Customer')[
'FirstSeq'].transform(lambda v: v.ffill())
# get difference of first Sequence and all others
df['FirstSeqDiff'] = abs(df['FirstSeq'] - df['Sequence'])
# create unique GroupID based on Sequence difference from first Sequence
df["GroupID"] = np.cumsum(df.FirstSeqDiff > 0.5) + 1
The above works for cases like Bob, Alice and Charlie but not Doug because it is always comparing to the first Sequence. How can I modify the code to change the compared Sequence value if the GroupID is incremented?
EDIT:
The dataframe will always be sorted by Customer and Sequence. I guess a better way to explain my goal is to assign a unique ID to all Sequence values whose difference are .5 or less, grouping by Customer.
The code has errors -> add df = df.astype({'Customer':str,'Sequence':np.float64}) would fix it. But still you cannot get what you want with this design. Try to define your own lambda function myfunc, which solves your problem directly:
data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'],
['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']]
df = pd.DataFrame(data, columns = ['Customer','Sequence'])
df = df.astype({'Customer':str,'Sequence':np.float64})
def myfunc(series):
ret = []
series = series.sort_values().values
for i,val in enumerate(series):
if i==0:
ret.append(1)
else:
ret.append(ret[-1]+(series[i]-series[i-1]>0.5))
return ret
df['EventID'] = df.groupby('Customer')[
'Sequence'].transform(lambda x: myfunc(x))
print (df)
Happy coding my friend.
I have following 2 data frames that are taken from excel files:
df_a = 10000 rows (like the master list that has all unique #s)
df_b = 670 rows
I am loading a excel file (df_b) that has zip, address, state and I want to match that info and then add on the supplier # from df_a so that I could have 1 file thats still 670 rows but now has the supplier row column.
df_a =
(10000 rows)
(unique)
supplier # ZIP ADDRESS STATE Unique Key
0 7100000 35481 14th street CA 35481-14th street-CA
1 7000005 45481 14th street CA 45481-14th street-CA
2 7000006 45482 140th circle CT 45482-140th circle-CT
3 7000007 35482 140th circle CT 35482-140th circle-CT
4 7000008 35483 13th road VT 35483-13th road-VT
df_b =
(670 rows)
ZIP ADDRESS STATE Unique Key
0 35481 14th street CA 35481-14th street-CA
1 45481 14th street CA 45481-14th street-CA
2 45482 140th circle CT 45482-140th circle-CT
3 35482 140th circle CT 35482-140th circle-CT
4 35483 13th road VT 35483-13th road-VT
OUTPUT:
df_c =
(670 rows)
ZIP ADDRESS STATE Unique Key (Unique)supplier #
0 35481 14th street CA 35481-14th street-CA 7100000
1 45481 14th street CA 45481-14th street-CA 7100005
2 45482 140th circle CT 45482-140th circle-CT 7100006
3 35482 140th circle CT 35482-140th circle-CT 7100007
4 35483 13th road VT 35483-13th road-VT 7100008
I tried merging the 2 dfs together but they are not matching and instead im getting a bunch of NAn
df10 = df_a.merge(df_b, on = 'Unique Key', how= 'left'
The result is 1 data frame with lots of columns and no matches. Also, Ive tried .map and .concat as well. I'm not sure whats going on.
have you tried
df10 = df_a.merge(df_b, on = 'Unique Key', how = "inner")
an 'inner join' retains only common records which, IIUC, is what you're trying to achieve
ADDED 2021-02-14
creating csvs from your test data and reading into pandas
df_mrg = df_a.merge(df_b[1:3], how='inner', on='Unique_Key')
df_mrg
produces:
Notes:
the slice on df_b to create a subset
changed column names (spaces and symbols other than _ make my skin crawl)
I also manually eliminated leading and trailing whitespaces for cell values in Unique_Key (there are string methods that can automate)
Also consider that:
df_mrg = df_a.merge(df_b[1:3], how='right', on='Unique_Key')
will return the same dataframe as 'inner' for the present data, but could be something worth testing depending on your data and what you want to know.
Also, merge permits passing a list of columns. Since the source columns for your compound key are in both tables, you could test for potential problems with compound key by:
df_mrg2 = df_a.merge(df_b[1:3], how='inner', on=['ZIP','ADDRESS','STATE'])
np.where(df_mrg2['Unique_Key_x']==df_mrg2['Unique_Key_y'],True,False)
df_mrg2 return the same record set as df_mrg, but without duplication of the 'on' fields.
All this goes way beyond answering your question, but hope it helps
I know you can do this with a series, but I can't seem to do this with a dataframe.
I have the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck
I am trying to produce the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays 10
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck 20
I know how to do this with string values using group by and join, but this only works on string values. I'm having issues converting the entire column of age to a string data type in the dataframe.
Any suggestions?
Use GroupBy.first with GroupBy.transform if want repeat first values per groups:
g = df.groupby('name')
df['note'] = g['note'].transform(' '.join)
df['age'] = g['age'].transform('first')
If need processing multiple columns - it means all numeric with first and all strings by join you can generate dictionary by columns names with functions, pass to GroupBy.agg and last use DataFrame.join:
cols1 = df.select_dtypes(np.number).columns
cols2 = df.columns.difference(cols1).difference(['name'])
d1 = dict.fromkeys(cols2, lambda x: ' '.join(x))
d2 = dict.fromkeys(cols1, 'first')
d = {**d1, **d2}
df1 = df[['name']].join(df.groupby('name').agg(d), on='name')
I have one data frame that has resulted from a spatial join between 2 Geopandas.GeoDataFrame objects.
Because there was more than one item overlapping with the target feature, the rows have been duplicated so that each row has the inherited information from each of the overlapping entities. To simulate this situation, we can run the following lines:
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))
cities = cities[['geometry', 'name']]
cities = cities.rename(columns={'name':'City'})
countries_with_city = geopandas.sjoin(world, cities, how="inner", op='intersects')
I am trying to generate a new column in the world geodaframe that contains a list of length 0,1 or +1, with the "City" attribute of all the overlapping cities of each country. For this, I wrote this so far:
for country in world.index:
subset_countries = countries_with_city.loc[countries_with_city.index==world.loc[country, "name"]]
a = subset_countries["City"].tolist()
list_of_names = list(subset_countries["City"])
world[list_of_names]=list_of_names
When I run this code, however, I get stuck at the line a = subset_countries["City"].tolist(). The error I get is 'str' object has no attribute 'tolist'.
According to what I have tested and investigated, it seems that I am getting this error because the first country [countries_with_city.loc[countries_with_city.index==world.loc[1, "name"]]] has only one city inside of it. Hence, when I slice the dataframe, there fact that there is only one row with index=1 makes the outcome a string, instead of data frame that can then be listed.
Is there a straightforward way I can use so that the code works in any case? (when there are 0, 1 and many cities). The goal is to generate a list of city names that will then be written in the world dataframe.
I am working on python 3
If I understand you correctly, one approach is to build a mapping from country name to a list of city names:
# Build a Series with index=countries, values=cities
country2city = countries_with_city.groupby('name')['City'].agg(lambda x: list(x))
# Use the mapping on the name column of the world DataFrame
world['city_list'] = world['name'].map(county)
# Peek at a nontrivial part of the result
world.drop('geometry', axis=1).tail()
pop_est continent name iso_a3 gdp_md_est city_list
172 218519.0 Oceania Vanuatu VUT 988.5 NaN
173 23822783.0 Asia Yemen YEM 55280.0 [Sanaa]
174 49052489.0 Africa South Africa ZAF 491000.0 [Cape Town, Bloemfontein, Johannesburg, Pretoria]
175 11862740.0 Africa Zambia ZMB 17500.0 [Lusaka]
176 12619600.0 Africa Zimbabwe ZWE 9323.0 [Harare]
If you intend to print the city lists right away, you can join the strings in each list to remove the square brackets:
world['city_str'] = world['city_list'].apply(lambda x: ', '.join(c for c in x)
if x is not np.nan else None)
# Sanity-check result
world.filter(like='city').tail()
city_list city_str
172 NaN None
173 [Sanaa] Sanaa
174 [Cape Town, Bloemfontein, Johannesburg, Pretoria] Cape Town, Bloemfontein, Johannesburg, Pretoria
175 [Lusaka] Lusaka
176 [Harare] Harare