Join two pandas dataframes and match data to one column

Join two pandas dataframes and match data to one column - python-3.x

I have following 2 data frames that are taken from excel files:
df_a = 10000 rows (like the master list that has all unique #s)
df_b = 670 rows
I am loading a excel file (df_b) that has zip, address, state and I want to match that info and then add on the supplier # from df_a so that I could have 1 file thats still 670 rows but now has the supplier row column.
df_a =
(10000 rows)
(unique)
supplier # ZIP ADDRESS STATE Unique Key
0 7100000 35481 14th street CA 35481-14th street-CA
1 7000005 45481 14th street CA 45481-14th street-CA
2 7000006 45482 140th circle CT 45482-140th circle-CT
3 7000007 35482 140th circle CT 35482-140th circle-CT
4 7000008 35483 13th road VT 35483-13th road-VT
df_b =
(670 rows)
ZIP ADDRESS STATE Unique Key
0 35481 14th street CA 35481-14th street-CA
1 45481 14th street CA 45481-14th street-CA
2 45482 140th circle CT 45482-140th circle-CT
3 35482 140th circle CT 35482-140th circle-CT
4 35483 13th road VT 35483-13th road-VT
OUTPUT:
df_c =
(670 rows)
ZIP ADDRESS STATE Unique Key (Unique)supplier #
0 35481 14th street CA 35481-14th street-CA 7100000
1 45481 14th street CA 45481-14th street-CA 7100005
2 45482 140th circle CT 45482-140th circle-CT 7100006
3 35482 140th circle CT 35482-140th circle-CT 7100007
4 35483 13th road VT 35483-13th road-VT 7100008
I tried merging the 2 dfs together but they are not matching and instead im getting a bunch of NAn
df10 = df_a.merge(df_b, on = 'Unique Key', how= 'left'
The result is 1 data frame with lots of columns and no matches. Also, Ive tried .map and .concat as well. I'm not sure whats going on.

have you tried
df10 = df_a.merge(df_b, on = 'Unique Key', how = "inner")
an 'inner join' retains only common records which, IIUC, is what you're trying to achieve
ADDED 2021-02-14
creating csvs from your test data and reading into pandas
df_mrg = df_a.merge(df_b[1:3], how='inner', on='Unique_Key')
df_mrg
produces:
Notes:
the slice on df_b to create a subset
changed column names (spaces and symbols other than _ make my skin crawl)
I also manually eliminated leading and trailing whitespaces for cell values in Unique_Key (there are string methods that can automate)
Also consider that:
df_mrg = df_a.merge(df_b[1:3], how='right', on='Unique_Key')
will return the same dataframe as 'inner' for the present data, but could be something worth testing depending on your data and what you want to know.
Also, merge permits passing a list of columns. Since the source columns for your compound key are in both tables, you could test for potential problems with compound key by:
df_mrg2 = df_a.merge(df_b[1:3], how='inner', on=['ZIP','ADDRESS','STATE'])
np.where(df_mrg2['Unique_Key_x']==df_mrg2['Unique_Key_y'],True,False)
df_mrg2 return the same record set as df_mrg, but without duplication of the 'on' fields.
All this goes way beyond answering your question, but hope it helps

Related

Is there a simple way to remove duplicate values in certain cells of a dataframe column?

I have a dataframe column with city locations and some of the cells have the same value (city) twice within each cell. I was wondering how to get rid of one of the values. eg. Instead of it saying Dublin Dublin below it will only say Dublin once.
I have tried df['city'].apply(set) but it doesn't give me what I am looking for.
Any advice much appreciated. Please see the image below:

You can split each item by (space) and convert each list of split strings to a set (which is deduplicated, but not sorted), and then re-join:
df['city'] = df['city'].str.split().apply(lambda x: pd.Series(x).drop_duplicates().tolist()).str.join(' ')
Output:
>>> df
city
0 Los Angeles CA
1 none
2 London
3 Dublin

MultiIndexing based on row values

Trying to create a simple program that finds negative values in a pandas dataframe and combines them with their matching row. Basically I have data that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
...
The idea is that I need to match up fills and refunds, then combine them into a single row. So, in the example above we'd have one row that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 5 1.74
Also, if a refund doesn't match to a fill row, I'm supposed to just delete it, which I've been accomplishing with .drop()
I'm imagining I can build a multiindex for this somehow, where each row that's a negative is marked as a refund and each fill row is marked as a fill. Then I just have some kind of for loop that goes through the list and attempts to match a certain number of times based on name/number of refunds.
Here's what I was trying:
pbm_negative_index = raw_pbm_data.loc['LastName','DrugName','RXNumber','ClientTotalCost']
names = pbm_negative_index = raw_pbm_data.loc[: , 'LastName']
unique_names = unique(pbm_negative_index)
for n in unique_names:
edf["Refund"] = edf["ClientTotalCost"].shift(1, fill_value=edf["ClientTotalCost"].head(1)) < 0
This obviously doesn't work and I'd like to use the indexing tools in Pandas to achieve a similar result.

Your specification reduces to two simple steps:
aggregate +ve & -ve matching rows
drop remaining -ve rows after aggregation
df = pd.read_csv(io.StringIO("""LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
ADAMS2 Drug. 100001 -5 -1.95
"""), sep="\s+")
# aggregate
dfa = df.groupby(["LastName","DrugName","RxNumber"],as_index=False).agg({"Amount":"sum","ClientTotalCost":"sum"})
# drop remaining -ve amounts
dfa = dfa.drop(dfa.loc[dfa.Amount.lt(0)].index)
LastName
DrugName
RxNumber
Amount
ClientTotalCost
0
ADAMS
Drug
100001
5
1.74

Split column data based on Condition Pandas Dataframe

I have a large data set and looking for something that will split my Street Address into two columns Street Number and Street Name.
I am trying to figure out how can I do this efficiently since I first need to process the street address and then check if the first index of the split has a digit or not.
So far I have a working code that looks like this. I created a two function one for extracting street number data from the street address, while the other one replaces the first occurrence of that street number from the street address.
def extract_street_number(row):
if any(map(str.isdigit, row.split(" ")[0])):
return row.split(" ")[0]
def extract_street_name(address, streetnumber):
if streetnumber:
return address.replace(streetnumber, "", 1)
else:
return address
Then using the apply function to have the two columns.
df[street_number] = df.apply(lambda row: extract_street_number(row[address_col]), axis=1)
df[street_name] = df.apply(lambda row: extract_street_name(row[address_col], row[street_number]), axis=1)
I'm wondering if there is a more efficient way to do this? Based on this current routine I need to build first the Street Number Column before I process the street name column.
I'm thinking of something like building the two series on the first iteration of the address column. The pseudo-code is something like this I just can't figure it out how can I code it in python.
Pseudocode:
Split Address into two columns based on first space that encounters a non-numeric character:
street_data = address.split(" ", maxsplit=1)
If street_data[0] has digits then return the columns on this way:
df[street_number] = street_data[0]
df[street_name] = street_data[1]
Else if street_data[0] is not digit then return the columns on this way:
df[street_number] = ""
df[street_name] = street_data[0] + " " + street_data[1]
# or just simply the address
df[street_name] = address
By the way this is the working sample of the data:
# In
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St', '513 Banks St', '5600 77 Center Dr', '1013 1/2 E Main St', '1234C Main St', '37-01 Fair Lawn Ave']})
# Out
Street_Number Street_Name
0 111 Rubin Center
1 Monroe St
2 513 Banks St
3 560 77 Center Dr
4 1013 1/2 E Main St
5 1234C Main St
6 37-01 Fair Lawn Ave

TL;DR:
This can be achieved in three steps-
Step 1-
df['Street Number'] = [street_num[0] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 2-
df['Street Address'] = [street_num[1] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 3-
df['Street Address'].loc[df['Street Address'].str.contains("N/A") == True] = df1['Address'].loc[df1['Street Address'].str.contains("N/A") == True]
Explanation-
Added two more test cases in the dataframe for code flexibility (Row 7,8)-
Step 1 - We separate the street numbers from the address here. This is done by slicing the first element from the list after splitting the address string and initialising to Street Number column.
If the first element doesn't contain a number, N/A is appended in the Street Number column.
Step 2 - As the first element in the sliced string contains the Street Number, the second element has to be the Street Address hence is appended to the Street Address column.
Step 3 - Due to step two, the Street Address become 'N/A' for the 'Address` that do not contain a number and that is resolved by this -
Hence, we can solve this in three steps after hours of struggle put in.

solution reflecting your pseudocode is below.
First lets divide "Address" and store is somewhere
new = df["Address"].str.split(" ", n = 1, expand = True)
df["First Part"]= new[0]
df["Last Part"]= new[1]
Next let's write down conditions
cond1 = df['First Part'].apply(str.isdigit)
cond2 = df['Last Part'].apply(str.isdigit)
Now check what meets given conditions
df.loc[cond1 & ~cond2, "Street"] = df.loc[cond1 & ~cond2, "Last Part"]
df.loc[cond1 & ~cond2, "Number"] = df.loc[cond1 & ~cond2, "First Part"]
df.loc[~cond1 & ~cond2, "Street"] = df.loc[~cond1 & ~cond2, ['First Part', 'Last Part']].apply(lambda x: x[0] + ' ' + x[1], axis = 1)
Finally let's clean-up those auxiliary columns
df.drop(["First Part", "Last Part"], axis = 1, inplace=True)
df
Address Street Number
0 111 Rubin Center Rubin Center 111
1 Monroe St Monroe St NaN
2 513 Banks St Banks St 513

#mock test
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St',
'513 Banks St', 'Banks 513 St',
'Rub Cent 111']})
unless i'm missing something, a bit of regex should solve ur request:
#gets number only if it starts the line
df['Street_Number'] = df.Address.str.extract(r'(^\d+)')
#splits only if number is at the start of the line
df['Street_Name'] = df.Address.str.split('^\d+').str[-1]
Address street_number street_name
0 111 Rubin Center 111 Rubin Center
1 Monroe St NaN Monroe St
2 513 Banks St 513 Banks St
3 Banks 513 St NaN Banks 513 St
4 Rub Cent 111 NaN Rub Cent 111
let me know where this falls flat

Make a list from one OR many string values in a pandas dataframe

I have one data frame that has resulted from a spatial join between 2 Geopandas.GeoDataFrame objects.
Because there was more than one item overlapping with the target feature, the rows have been duplicated so that each row has the inherited information from each of the overlapping entities. To simulate this situation, we can run the following lines:
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))
cities = cities[['geometry', 'name']]
cities = cities.rename(columns={'name':'City'})
countries_with_city = geopandas.sjoin(world, cities, how="inner", op='intersects')
I am trying to generate a new column in the world geodaframe that contains a list of length 0,1 or +1, with the "City" attribute of all the overlapping cities of each country. For this, I wrote this so far:
for country in world.index:
subset_countries = countries_with_city.loc[countries_with_city.index==world.loc[country, "name"]]
a = subset_countries["City"].tolist()
list_of_names = list(subset_countries["City"])
world[list_of_names]=list_of_names
When I run this code, however, I get stuck at the line a = subset_countries["City"].tolist(). The error I get is 'str' object has no attribute 'tolist'.
According to what I have tested and investigated, it seems that I am getting this error because the first country [countries_with_city.loc[countries_with_city.index==world.loc[1, "name"]]] has only one city inside of it. Hence, when I slice the dataframe, there fact that there is only one row with index=1 makes the outcome a string, instead of data frame that can then be listed.
Is there a straightforward way I can use so that the code works in any case? (when there are 0, 1 and many cities). The goal is to generate a list of city names that will then be written in the world dataframe.
I am working on python 3

If I understand you correctly, one approach is to build a mapping from country name to a list of city names:
# Build a Series with index=countries, values=cities
country2city = countries_with_city.groupby('name')['City'].agg(lambda x: list(x))
# Use the mapping on the name column of the world DataFrame
world['city_list'] = world['name'].map(county)
# Peek at a nontrivial part of the result
world.drop('geometry', axis=1).tail()
pop_est continent name iso_a3 gdp_md_est city_list
172 218519.0 Oceania Vanuatu VUT 988.5 NaN
173 23822783.0 Asia Yemen YEM 55280.0 [Sanaa]
174 49052489.0 Africa South Africa ZAF 491000.0 [Cape Town, Bloemfontein, Johannesburg, Pretoria]
175 11862740.0 Africa Zambia ZMB 17500.0 [Lusaka]
176 12619600.0 Africa Zimbabwe ZWE 9323.0 [Harare]
If you intend to print the city lists right away, you can join the strings in each list to remove the square brackets:
world['city_str'] = world['city_list'].apply(lambda x: ', '.join(c for c in x)
if x is not np.nan else None)
# Sanity-check result
world.filter(like='city').tail()
city_list city_str
172 NaN None
173 [Sanaa] Sanaa
174 [Cape Town, Bloemfontein, Johannesburg, Pretoria] Cape Town, Bloemfontein, Johannesburg, Pretoria
175 [Lusaka] Lusaka
176 [Harare] Harare

I need to find a way to show the number of relationships between values in two different excel columns

OK, I have two columns in excel that contain city names. I need a rank of how many times a relationship between two cities occurs. For example, the ranking for the data below should be as follows. #1 is Austin to Dallas with 3 occurrences. #2 is Chicago to Boston with 2 occurrences. #3 is Chicago to New York with 1 occurrence.
sample data set

You can use a =COUNTIFS statement to check specific cities.
For example:
Row 1 = Headers
Column A = Origin City
Column B = Destination City
Data in your table should be in A2:B7
You can use:
=COUNTIFS($A$2:$A$7,"Austin",$B$2:$B$7,"Dallas")
=COUNTIFS($A$2:$A$7,"Chicago",$B$2:$B$7,"Boston")
=COUNTIFS($A$2:$A$7,"Chicago",$B$2:$B$7,"New York")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Join two pandas dataframes and match data to one column - python-3.x

Related

Is there a simple way to remove duplicate values in certain cells of a dataframe column?

MultiIndexing based on row values

Split column data based on Condition Pandas Dataframe

Make a list from one OR many string values in a pandas dataframe

I need to find a way to show the number of relationships between values in two different excel columns

Categories

Resources