identify common column values across two different sized dataframes in pandas - python-3.x

I have two dataframes of different row and column sizes. I want to compare the two and create new columns in df2 based on whether values exist in df1. First for an example (I think you can copy/paste this text into a .csv to import), df1 looks like this:
subject block target dist1 dist2 dist3
7 1 doorlock candleholder01 jar03 stroller
7 2 glassescase clownfish kangaroo ram
7 3 badger chocolatefonduedish hosenozzle toycar04
7 4 hyena crocodile pig toad
7 1 scooter cormorant lizard rockbass
df2 like this:
subject image
7 acorn
7 chainsaw
7 doorlock
7 stroller
7 bathtub
7 clownfish
7 bagtie
7 birdie
7 witchhat
7 crocodile
7 honeybee
7 electricitymeter
7 flowerwreath
7 jar03
7 camera02a
and what I'd like to achieve is this:
subject image present type block
7 acorn 0 NA NA
7 chainsaw 0 NA NA
7 doorlock 1 target 1
7 stroller 1 dist3 1
7 bathtub 0 NA NA
7 clownfish 1 dist1 2
7 bagtie 0 NA NA
7 birdie 0 NA NA
7 witchhat 0 NA NA
7 crocodile 1 dist1 4
7 honeybee 0 NA NA
7 electricitymeter 0 NA NA
7 flowerwreath 0 NA NA
7 jar03 1 dist2 1
7 camera02a 0 NA NA
Specifically, I would like to identify, from the 4 columns in df1 ('target', 'dist1', 'dist2', 'dist3'), which values exist in the 'image' column of df2, and then (1) generate a column (boolean or 0/1) in df2 indicating whether that value exists in df1, (2) generate a second column in df2 with the name of the column in which that item exists in df1 (i.e. 'target', 'dist1', ...), and finally (3) generate a column in df2 with the df1 'block' value from which that item came from, if any.
I hope this is clear. I'd also like some ideas on how to handle the cases that don't match - should I code these as NAN or just empty strings? The thing is I will probably be groupby()'ing later, and I had some problems with groupby() when the df contained missing values..

You can do this by using melt on df1 and merge.
df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')
subject image present type block
0 7 acorn 0 NaN NaN
1 7 chainsaw 0 NaN NaN
2 7 doorlock 1 target 1.0
3 7 stroller 1 dist3 1.0
4 7 bathtub 0 NaN NaN
5 7 clownfish 1 dist1 2.0
6 7 bagtie 0 NaN NaN
7 7 birdie 0 NaN NaN
8 7 witchhat 0 NaN NaN
9 7 crocodile 1 dist1 4.0
10 7 honeybee 0 NaN NaN
11 7 electricitymeter 0 NaN NaN
12 7 flowerwreath 0 NaN NaN
13 7 jar03 1 dist2 1.0
14 7 camera02a 0 NaN NaN
As for the missing values, I would keep them as NaN. pandas is pretty powerful in terms of working with missing data, so may as well take advantage of this.

Related

How to add value to specific index that is out of bounds

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?
according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

Groupby count of non NaN of another column and a specific calculation of the same columns in pandas

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2
Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0
Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

For and if loop combination takes lot of time in Pandas (Data manipulation)

I have two datasets, each about half a million observations. I am writing the below code and it seems the code never seems to stop executing. I would like to know if there is a better way of doing it. Appreciate inputs.
Below are sample formats of my dataframes. Both dataframes share a set of 'sid' values , meaning all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values. The 'tid' values and consequently the 'rid' values (which are a combination of 'sid' and 'tid' values) may not appear in both sets.
The task is simple. I would like to create the 'tv' column in df2. Wherever the 'rid' in df2 matches with the 'rid' in 'df1', the 'tv' column in df2 takes the corresponding 'tv' value from df1. If it does not match, the 'tv' value in 'df2' will be the median 'tv' value for the matching 'sid' subset in 'df1'.
In fact my original task includes creating a few more similar columns like 'tv' in df2 (based on their values in 'df1' ; these columns exist in 'df1').
I believe as my code contains for loop combined with if else statement and multiple value assignment statements, it is taking forever to execute. Appreciate any inputs.
df1
sid tid rid tv
0 0 0 0-0 9
1 0 1 0-1 8
2 0 3 0-3 4
3 1 5 1-5 2
4 1 7 1-7 3
5 1 9 1-9 14
6 1 10 1-10 24
7 1 11 1-11 13
8 2 14 2-14 2
9 2 16 2-16 5
10 3 17 3-17 6
11 3 18 3-18 8
12 3 20 3-20 5
13 3 21 3-21 11
14 4 23 4-23 6
df2
sid tid rid
0 0 0 0-0
1 0 2 0-2
2 1 3 1-3
3 1 6 1-6
4 1 9 1-9
5 2 10 2-10
6 2 12 2-12
7 3 1 3-1
8 3 15 3-15
9 3 1 3-1
10 4 19 4-19
11 4 22 4-22
rids = [rid.split('-') for rid in df1.rid]
for r in df2.rid:
s,t = r.split('-')
if [s,t] in rids:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.rid == r,'tv']
else:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.sid == int(s),'tv'].median()
The expected df2 shall be as follows:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
You can left merge on df2 with a subset(because you need only tv column you can also pass the df1 without any subset) of df1 on 'rid' then calculate median and fill values:
out=df2.merge(df1[['rid','tv']],on='rid',how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tid_y','tv_y'], axis=1)
out = out.rename(columns = {'tid_x': 'tid'})
out
OR
Since you said that:
all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values
So you can also left merge them on ['sid','rid'] and then fillna() value of tv with the median of df1 'tv' column by mapping values using map() method:
out=df2.merge(df1,on=['sid','rid'],how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tv_y'], axis=1)
out
output of out:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
Here is a suggestion without any loops, based on dictionaries:
matching_values = dict(zip(df1['rid'][df1['rid'].isin(df2['rid'])], df1['tv'][df1['rid'].isin(df2['rid'])]))
df2[df2['rid'].isin(df1['rid'])]['tv'] = df2[df2['rid'].isin(df1['rid'])]['rid']
df2[df2['rid'].isin(df1['rid'])]['tv'].replace(matching_values)
median_values = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])].groupby('sid')['tv'].median().to_dict()
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'] = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['sid']
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'].replace(median_values)
This should do the trick. The logic here is that we first create a dictionary, in which the "rid and "sid" values are the keys and the median and matching "tv" values are the dictionary values. Next, we replace the "tv" values in df2 with the rid and sid keys, respectively, (because they are the dictionary keys) which can thus easily be replaced by the correct tv values by calling .replace().
Don't use for loops in pandas, that is known to be slow. That way you don't get to benefit from all the internal optimizations that have been made.
Try to use the split-apply-combine pattern:
split df1 into sid to calculate the median: df1.groupby('sid')['tv'].median()
join df2 on df1: df2.join(df1.set_index('rid'), on='rid')
fill the NaN values with the median calculated in step 1.
(Haven't tested the code).

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

Resources