pandas merge with condition - python-3.x

I have two dataframe and want to merge it based on max of another column
df1:
C2
A
B
C
df2:
C1 C2 val
X A 100
Y A 50.5
Z A 60
E B 90
F B 45
G C 100
I tried,
df3 = df1.merge(df2, on='C2', how='inner')['val'].max()
I get the error, AttributeError: 'numpy.float64' object has no attribute 'head'
The val column has only numbers. How should I modify this and Why do I encounter this error ?
The expected output is:
df3:
C2 C1 val
A X 100
B E 90
C G 100
Thanks in advance.

I think you need merge by left join:
df3 = df2.merge(df1, on='C2', how='left')
And then groupby with idxmax for indices of max values per groups and select rows by loc:
df3 = df3.loc[df3.groupby('C2')['val'].idxmax()]
Or use sort_values with drop_duplicates:
df3 = df3.sort_values(['C2', 'val']).drop_duplicates('C2', keep='last')
print (df3)
C1 C2 val
0 X A 100.0
3 E B 90.0
5 G C 100.0
Why do I encounter this error ?
Problem is you get scalar - max value of column val:
df3 = df1.merge(df2, on='C2', how='inner')['val'].max()
print (df3)
100.0
So if use print (df3.head()) it failed.

Related

Concat values on dataframe columns excluding NaN's

I have a dataframe with n store columns, here I'm just showing the first 2:
ref_id store_0 store_1
0 100 c b
1 300 d NaN
I want a way to concat only the non-NaN values from store columns into a new column adding a comma between each value, and finally drop those columns. Desired output is:
ref_id stores
0 100 c,b
1 300 d
Right now I've tried df['stores'] = df['store_0'] + ',' + df['store_1'] with this result:
ref_id store_0 store_1 stores
0 100 c b c,b
1 300 d NaN NaN
You can use:
cols = df.filter(like='store_').columns
df2 = (df
.drop(columns=cols)
.assign(stores=df[cols].agg(lambda s: s.dropna()
.str.cat(sep=','),
axis=1))
)
Or, for in place modification:
cols = df.filter(like='store_').columns
df['stores'] = df[cols].agg(lambda s: s.dropna().str.cat(sep=','), axis=1)
df.drop(columns=cols, inplace=True)
Output:
ref_id stores
0 100 c,b
1 300 d
You can try
df_ = df.filter(like='store')
df = (df.assign(store=df_.apply(lambda row : row.str.cat(sep=','), axis=1))
.drop(df_.columns, axis=1))
print(df)
ref_id store
0 100 c,b
1 300 d
Try with
df['store'] = df.filter(like = 'store').apply(lambda x : ','.join(x[x==x]),1)
df
Out[60]:
ref_id store_0 store_1 store
0 100 c b c,b
1 300 d NaN d

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

Sum in Column based on condition in rows in pandas dataframe [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

How to convert column into row?

Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'
Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'
Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'
But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like
1) I dont know which all columns will have same values
2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value
df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()
df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']
To be more clear, the below code creates the sample dataframe & expected output
df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
'01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
'Product': ['a','a','b','c','d','d', 'e','e']})
expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
, 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})
you may do following to get expected_output from df
s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)
Out[57]:
Date Date1 Product Product1
Cust_id gender
100 M 01/01/2019 02/01/2019 a a
101 F 01/01/2019 01/01/2019 b c
102 M 03/01/2019 NaN d NaN
103 F 04/01/2019 NaN d NaN
104 F 03/01/2019 03/01/2019 e e
Next, replace columns having duplicated values with NA
df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()
Out[72]:
Cust_id gender Date Date1 Product Product1
0 100 M 01/01/2019 02/01/2019 a NA
1 101 F 01/01/2019 NA b c
2 102 M 03/01/2019 NA d NA
3 103 F 04/01/2019 NA d NA
4 104 F 03/01/2019 NA e NA
You can try this code - it could be a little cleaner but I think it does the job
df = pd.DataFrame({'a':[100, 100], 'b':['tue', 'tue'], 'c':['yes', 'yes'],
'd':['ok', 'not ok'], 'e':['ok', 'maybe'], 'f':[55, 66]})
df_transformed = pd.DataFrame()
for column in df.columns:
col_vals = df.groupby(column)['b'].count().index.values
for ix, col_val in enumerate(col_vals):
temp_df = pd.DataFrame({column + str(ix) : [col_val]})
df_transformed = pd.concat([df_transformed, temp_df], axis = 1)
Output for df_transformed

Upsert function in Dataframe - Python

I am trying to update one dataframe with another dataframe with respect to the first column. If there is an extra row in the second dataframe, it should be inserted in the first dataframe. It there is a row with the same data in the first column but different data in the other coulmns, that row should be updated. Also, the row which has no value in the first column should be dropped.
Code used -
df = df_1.combine_first(df_2)\
.reset_index()\
.reindex(columns=df_1.columns)
df = df.drop_duplicates(subset='A', keep= 'last', inplace=False)
df.dropna(subset=['A'])
print ("Final Data")
print (df)
First Dataframe -
A B C
0 45 a b
1 98 c d
2 67 bn k
Second Dataframe -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4
Final should look like -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
The final dataframe that I get -
A B C
0 45.0 a b
1 98.0 c d
2 67.0 bn k
3 90.0 x z
4
So, neither the data is getting updated, nor is it removing the row with null values. What am I missing?
Based on my understanding of your question, your second dataframe basically supercedes the first, if there is a matching index. If there isn't, then the difference is added to the first dataframe. I am also assuming that there are no duplicate keys in the first column, A.
Framing this requirement a little differently, the final output should contain all the rows in the second dataframe, as well as the values (since they are meant to overwrite the first dataframe if there's a match). Therefore, we will start off using the second dataframe as it is, and then add back the rows that exist in the first dataframe but not in the second. See the example below. (I'm also using a slightly different first dataframe to highlight the effects)
import pandas as pd
df1 = pd.DataFrame({'A':[45,98,67,91],'B':['a','c','bn','y'],'C':['b','d','k','oo']})
df2 = pd.DataFrame({'A':[45,98,67,90,''],'B':['a','c','bn','x',''],'C':['d','d','k','z','']})
# Remove rows with empty values in first column. This should be whatever conditions applicable to you i.e. checking for np.nan instead of str('')
df2 = df2.loc[df2['A'] != '']
df1.set_index('A', inplace=True)
df2.set_index('A', inplace=True)
# Find keys in dataframe 1 that are not in dataframe 2
idx_diff = df1.index.difference(df2.index)
# Append these rows to dataframe 2
df_ins = df1.loc[idx_diff]
df3 = df2.append(df_ins)
df3.reset_index(inplace=True)
>>>df3
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4 91 y oo

Resources