Pandas merge, conditional update in Conflicting Columns - python-3.x

I have two Data Frames, df1 & df2 (see below), which i would like to
merge on one of the common columns
conditionally update the other common columns.
Sample Data Frame and expected results.
df1:
A B C
0 123 1819. NaN
1 456 NaN 115
2 789 9012. NaN
3 121 8732. NaN
4 883 NaN 171
5 771 8871. 191
# df2:
C B
0 115 41853
1 115 22723
2 115 57302
3 115 91494
4 171 43607
5 171 36327
6 191 39874
7 191 25456
8 191 76283
9 191 97506
merge on column C
# how='left' is necessary
pd.merge(df1, df2, on='C', how='left')
A B_x C B_y
0 123 1819.0 NaN NaN
1 456 NaN 115.0 41853.0
2 456 NaN 115.0 22723.0
3 456 NaN 115.0 57302.0
4 456 NaN 115.0 91494.0
5 789 9012.0 NaN NaN
6 121 8732.0 NaN NaN
7 883 NaN 171.0 43607.0
8 883 NaN 171.0 36327.0
9 771 NaN 191.0 39874.0
10 771 NaN 191.0 25456.0
11 771 NaN 191.0 76283.0
12 771 NaN 191.0 97506.0
Conditionally combine columns B_x and B_y i.e. replace the NaN values in the left_table (B_x) with non-NaN values with right_table (B_y)
PS: Assume that both B_x and B_y are never simultaneously NaN
The End Result:
A C B
0 123 NaN 1819
1 456 115.0 41853
2 456 115.0 22723
3 456 115.0 57302
4 456 115.0 91494
5 789 NaN 9012
6 121 NaN 8732
7 883 171.0 43607
8 883 171.0 36327
9 771 191.0 39874
10 771 191.0 25456
11 771 191.0 76283
12 771 191.0 97506
I am aware of the function combine_first, but it works only with indices.

After merge using np.where
df=pd.merge(df1, df2, on='C', how='left')
df['B']=np.where(df.B_x.isnull(),df.B_y,df.B_x)
df.drop(['B_x','B_y'],1,inplace=True)
df
Out[136]:
A C B
0 123 NaN 1819.0
1 456 115.0 41853.0
2 456 115.0 22723.0
3 456 115.0 57302.0
4 456 115.0 91494.0
5 789 NaN 9012.0
6 121 NaN 8732.0
7 883 171.0 43607.0
8 883 171.0 36327.0
9 771 191.0 8871.0
10 771 191.0 8871.0
11 771 191.0 8871.0
12 771 191.0 8871.0

Related

concat a transposed dataframe to another dataframe and write it to a spreadsheet

I have a data frame looks like this:
Date Demand Forecast Error [Error] Error% Error^2
1 47
2 70
3 95
4 122 61 51 51 54 23
5 142 86 34 51 54 23
6 155 110 34 51 45 36
7 189 130 54 51 45 86
8 208 152
9 160 174
10 142 176
11 160
12 160
13 160
and the df2 that looks like:
Bias MAPE MAE RMSE_rel
0 0.143709 0.273529 42.285714 43.198692
1 22.952381 0.273529 0.264758 0.270475
the df2 would be transposed with new columns absulote,scaled to look like this:
df2.set_index('Bias').T
df2.columns= ["Absulote", "Scaled"]
Absulote Scaled
MAPE 0.273529 0.273529
MAE 42.285714 0.264758
RMSE_rel 43.198692 0.270475
and there is no Bias
to Concatenate both I do this :
complete_df = pd.concat([df1, df2],axis=0, ignore_index=True)
I get this result:
Demand Forecast Error Absulote Scaled
0 47.0 NaN NaN NaN NaN
1 70.0 NaN NaN NaN NaN
2 95.0 NaN NaN NaN NaN
3 122.0 70.666667 51.333333 NaN NaN
4 142.0 95.666667 46.333333 NaN NaN
5 155.0 119.666667 35.333333 NaN NaN
6 189.0 139.666667 49.333333 NaN NaN
7 208.0 162.000000 46.000000 NaN NaN
8 160.0 184.000000 -24.000000 NaN NaN
9 142.0 185.666667 -43.666667 NaN NaN
10 NaN 170.000000 NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN 0.273529 0.273529
15 NaN NaN NaN 42.285714 0.264758
16 NaN NaN NaN 43.198692 0.270475
there is no Bias MAPE MAE RMSE_rel
I want the result to imitate the following :
is there anyway to achieve that?

Fill NaN values based on specific condition in pandas

I have a dataframe as shown below
Date t_factor t1 t2 t3 t_function
2020-02-01 5 4 NaN NaN 4
2020-02-03 23 6 NaN NaN 6
2020-02-06 14 9 NaN NaN 9
2020-02-09 23 NaN NaN NaN 0
2020-02-10 23 NaN NaN NaN 0
2020-02-11 23 NaN NaN NaN 0
2020-02-13 30 NaN 3 NaN 3
2020-02-20 29 NaN 66 NaN 66
2020-02-29 100 NaN 291 NaN 291
2020-03-01 38 NaN NaN NaN 0
2020-03-10 38 NaN NaN NaN 0
2020-03-11 38 NaN NaN 4 4
2020-03-26 70 NaN NaN 4 4
2020-03-29 70 NaN NaN 4 4
In which I would like to fill NaN values after non NaN value as last NaN value of that column
Here the columns I wanted to impute are t1, t2 and t3.
Expected Output
Date t_factor t1 t2 t3 t_function
2020-02-01 5 4 NaN NaN 4
2020-02-03 23 6 NaN NaN 6
2020-02-06 14 9 NaN NaN 9
2020-02-09 23 9 NaN NaN 0
2020-02-10 23 9 NaN NaN 0
2020-02-11 23 9 NaN NaN 0
2020-02-13 30 9 3 NaN 3
2020-02-20 29 9 66 NaN 66
2020-02-29 100 9 291 NaN 291
2020-03-01 38 9 291 NaN 0
2020-03-10 38 9 291 NaN 0
2020-03-11 38 9 291 4 4
2020-03-26 70 9 291 4 4
2020-03-29 70 9 291 4 4
Use ffill:
df[['t1', 't2', 't3']] = df[['t1', 't2', 't3']].ffill()
Result:
Date t_factor t1 t2 t3 t_function
0 2020-02-01 5 4.0 NaN NaN 4
1 2020-02-03 23 6.0 NaN NaN 6
2 2020-02-06 14 9.0 NaN NaN 9
3 2020-02-09 23 9.0 NaN NaN 0
4 2020-02-10 23 9.0 NaN NaN 0
5 2020-02-11 23 9.0 NaN NaN 0
6 2020-02-13 30 9.0 3.0 NaN 3
7 2020-02-20 29 9.0 66.0 NaN 66
8 2020-02-29 100 9.0 291.0 NaN 291
9 2020-03-01 38 9.0 291.0 NaN 0
10 2020-03-10 38 9.0 291.0 NaN 0
11 2020-03-11 38 9.0 291.0 4.0 4
12 2020-03-26 70 9.0 291.0 4.0 4
13 2020-03-29 70 9.0 291.0 4.0 4
We can define a function for that
def imporove(iterable):
for i in range(len(iterable)):
if iterable[i].isnull() == True:
iterable[i] = iterable[i-1]
I hope you got a basic idea.
now you can pass
df['t1'].apply(improve)
Here is how I will go:
def fill_na(col):
ind = df[col].last_valid_index()
df[col][ind+1:].fillna(df[col][ind], inplace=True)
fill_na('t1')
fill_na('t2')
fill_na('t3')

Create a specific column by looping over the user defined dictionary in pandas

I have a df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 23
2020-02-06 14
2020-02-09 23
2020-02-10 23
2020-02-11 23
2020-02-13 30
2020-02-20 29
2020-02-29 100
2020-03-01 38
2020-03-10 38
2020-03-11 38
2020-03-26 70
2020-03-29 70
From that I would like to create a function that will calculate the column called t_function based on the calculated values t1, t2 and t3.
where input parameters are stored in a dictionary as shown below.
d1 = {'b1': {'s': '2020-02-01', 'e':'2020-02-06', 'coef':[3, 1, 0]},
'b2': {'s': '2020-02-13', 'e':'2020-02-29', 'coef':[2, 0, 1]},
'b3': {'s': '2020-03-11', 'e':'2020-03-29', 'coef':[4, 0, 0]}}
Expected output:
Date t_factor t1 t2 t3 t_function
2020-02-01 5 4 NaN NaN 4
2020-02-03 23 6 NaN NaN 6
2020-02-06 14 9 NaN NaN 9
2020-02-09 23 NaN NaN NaN 0
2020-02-10 23 NaN NaN NaN 0
2020-02-11 23 NaN NaN NaN 0
2020-02-13 30 NaN 3 NaN 3
2020-02-20 29 NaN 66 NaN 66
2020-02-29 100 NaN 291 NaN 291
2020-03-01 38 NaN NaN NaN 0
2020-03-10 38 NaN NaN NaN 0
2020-03-11 38 NaN NaN 4 4
2020-03-26 70 NaN NaN 4 4
2020-03-29 70 NaN NaN 4 4
I tried below code
def fun(x, start="2020-02-01", end="2020-02-06", a0=3, a1=1, a2=0):
start = datetime.strptime(start, "%Y-%m-%d")
end = datetime.strptime(end, "%Y-%m-%d")
if start <= x.Date <= end:
t2 = (x.Date - start)/np.timedelta64(1, 'D') + 1
diff = a0 + a1*t2 + a2*(t2)**2
else:
diff = np.NaN
return diff
df["t1"] = df.apply(lambda x: fun(x), axis=1)
df["t2"] = df.apply(lambda x: fun(x, "2020-02-13", "2020-02-29", 2, 0, 1), axis=1)
df["t3"] = df.apply(lambda x: fun(x, "2020-03-11", "2020-03-29", 4, 0, 0), axis=1)
df["t_function"] = df['t1'].fillna(0) + df['t2'].fillna(0) + df['t3'].fillna(0)
Above code I would like change by looping over the dictionary d1.
Note:
The dictionary d1 may have more than three keys such as 'b1', 'b2', 'b3', 'b4' then we have to create t1, t2, t3 and t4 columns. I would like to automate this with looping over the dictionary d1:
I would propose that you store the data as a list of tuples. Like so,
params = [('2020-02-01', '2020-02-06', 3, 1, 0),
('2020-02-13', '2020-02-29', 2, 0, 1),
('2020-03-11', '2020-03-29', 4, 0, 0)]
Now all you need is to loop over  params and add the columns to your dataframe df.
total = None
for i, param in enumerate(params):
s, e, a0, a1, a2 = param
df[f"t{i+1}"] = df.apply(lambda x: fun(x, s, e, a0, a1, a2), axis=1)
if i==0:
total = df[f"t{i+1}"].fillna(0)
else:
total += df[f"t{i+1}"].fillna(0)
df["t_function"] = total
This gives the desired output:
Date t_factor t1 t2 t3 t_function
0 2020-02-01 5 4.0 NaN NaN 4.0
1 2020-02-03 23 6.0 NaN NaN 6.0
2 2020-02-06 14 9.0 NaN NaN 9.0
3 2020-02-09 23 NaN NaN NaN 0.0
4 2020-02-10 23 NaN NaN NaN 0.0
5 2020-02-11 23 NaN NaN NaN 0.0
6 2020-02-13 30 NaN 3.0 NaN 3.0
7 2020-02-20 29 NaN 66.0 NaN 66.0
8 2020-02-29 100 NaN 291.0 NaN 291.0
9 2020-03-01 38 NaN NaN NaN 0.0
10 2020-03-10 38 NaN NaN NaN 0.0
11 2020-03-11 38 NaN NaN 4.0 4.0
12 2020-03-26 70 NaN NaN 4.0 4.0
13 2020-03-29 70 NaN NaN 4.0 4.0

Dropna By Column by levels in multiindex and swap for non-na values

I am trying to do some transformations and kind of stuck. Hopefully somebody, can help me out here.
l0 a b c d e f
l1 1 2 1 2 1 2 1 2 1 2 1 2
0 NaN NaN NaN NaN 93.4 NaN NaN NaN NaN NaN 19.0 28.9
1 NaN 9.0 NaN NaN 43.5 32.0 NaN NaN NaN NaN NaN 3.4
2 NaN 5.0 NaN NaN 93.3 83.6 NaN NaN NaN NaN 59.5 28.2
3 NaN 19.6 NaN NaN 72.8 47.4 NaN NaN NaN NaN 31.5 67.2
4 NaN NaN NaN NaN NaN 62.5 NaN NaN NaN NaN NaN 1.8
I have a dataframe, (shown above), and as u can see that, there are multiple 'NaN' with an multiindex column. Selecting the columns along level = 0 (i.e. l0)
I would like to drop the entire column if all are NaN. so, in this case the column's
l0 = ['b', 'd', 'e'] # drop-cols
should be dropped from the Dataframe
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
This will give me the dataframe (as shown above). I would like to then slide values along the rows if all the entries before are null (or swap values between adjacent cols). e.g. Looking at index = 0 i.e. first row.
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
Since, all the values in col - a are null.
I would like to slide / swap values first b/w col - a and col - c.
and then receprocate the same for columns along the right-side i.e. replace entries in col-c with col-f and make all entries in col-f, NaN giving me
l0 a c f
l1 1 2 1 2 1 2
0 93.4 NaN 19.0 28.9 NaN NaN
This is really to save memory for processing and storing information, as interchainging labels ['a', 'b', 'c'...] does not change the meaning of the data.
EDIT: Any Idea's for (2)
I have managed to solve (1) with the following code:
for c in df.columns.get_level_values(0).unique():
if df[c].isna().all().all():
df = df.drop(columns=[c])
df
You can do with all
s=df.isnull().all(level=0,axis=1).all()
df.drop(s.index[s],axis=1,level=0)
Out[55]:
a c f
1 2 1 2 1 2
l1
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
groupby and filter
df.groupby(axis=1, level=0).filter(lambda d: ~d.isna().all().all())
a c f
1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
A little bit shorter
df.groupby(axis=1, level=0).filter(lambda d: ~np.all(d.isna()))

Element-wise division by rows between dataframe and series

I've just started with pandas some weeks ago and now I am trying to perform an element-wise division on rows, but couldn't figure out the proper way to achieve it. Here is my case and data
date type id ... 1096 1097 1098
0 2014-06-13 cal 1 ... 17.949524 16.247619 15.465079
1 2014-06-13 cow 32 ... 0.523429 -0.854286 -1.520952
2 2014-06-13 cow 47 ... 7.676000 6.521714 5.892381
3 2014-06-13 cow 107 ... 4.161714 3.048571 2.419048
4 2014-06-13 cow 137 ... 3.781143 2.557143 1.931429
5 2014-06-13 cow 255 ... 3.847273 2.509091 1.804329
6 2014-06-13 cow 609 ... 6.097714 4.837714 4.249524
7 2014-06-13 cow 721 ... 3.653143 2.358286 1.633333
8 2014-06-13 cow 817 ... 6.044571 4.934286 4.373333
9 2014-06-13 cow 837 ... 9.649714 8.511429 7.884762
10 2014-06-13 cow 980 ... 1.817143 0.536571 -0.102857
11 2014-06-13 cow 1730 ... 8.512571 7.114286 6.319048
12 2014-06-13 dark 1 ... 168.725714 167.885715 167.600001
my_data.columns
Index(['date', 'type', 'id', '188', '189', '190', '191', '192', '193', '194',
...
'1089', '1090', '1091', '1092', '1093', '1094', '1095', '1096', '1097',
'1098'],
dtype='object', length=914)
My goal is to divide all the rows by the row with "type" == "cal", but from the column '188' to the column '1098' (911 columns)
These are the approaches I have tried:
Extracting the row of interest and using it with apply(), divide() and
operator '/':
>>> cal_r = my_data[my_data["type"]=="cal"].iloc[:,3:]
my_data.apply(lambda x: x.iloc[3:]/cal_r, axis=1)
0 188 189 190 191 192 193 194 195 ... 1091 10...
1 188 189 190 ... 10...
2 188 189 190 ... 109...
3 188 189 190 ... 1096...
4 188 189 190 191 ... ...
5 188 189 190 ... 10...
6 188 189 190 ... 109...
7 188 189 190 ... 1096...
8 188 189 190 ... 1096...
9 188 189 190 ... 1096 ...
10 188 189 190 ... 1...
11 188 189 190 ... 109...
12 188 189 190 191 ... ...
dtype: object
>>> mydata.apply(lambda x: x.iloc[3:].divide(cal_r,axis=1), axis=1)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 6014, in apply
return op.get_result()
File "/usr/local/lib/python3.5/dist-packages/pandas/core/apply.py", line 142, in get_result
return self.apply_standard()
File "/usr/local/lib/python3.5/dist-packages/pandas/core/apply.py", line 248, in apply_standard
self.apply_series_generator()
File "/usr/local/lib/python3.5/dist-packages/pandas/core/apply.py", line 277, in apply_series_generator
results[i] = self.f(v)
File "<input>", line 1, in <lambda>
File "/usr/local/lib/python3.5/dist-packages/pandas/core/ops.py", line 1375, in flex_wrapper
self._get_axis_number(axis)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 375, in _get_axis_number
.format(axis, type(self)))
ValueError: ("No axis named 1 for object type <class 'pandas.core.series.Series'>", 'occurred at index 0')
Without using apply:
>>> my_data.iloc[:,3:].divide(cal_r)
188 189 190 191 192 193 ... 1093 1094 1095 1096 1097 1098
0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0
1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
The commands my_data.iloc[:,3:].divide(cal_r, axis=1) and my_data.iloc[:,3:]/cal_r give the same result, divides just the first row.
If I select just one row, it is done well:
my_data.iloc[5,3:]/cal_r
188 189 190 ... 1096 1097 1098
0 48.8182 48.8274 22.4476 ... 0.214338 0.154428 0.116671
[1 rows x 911 columns]
Is there something basic I am missing? I suspect that I will need to replicate the cal_r row the same number of rows of the whole data.
Any hint or guidance is really appreciated.
Related: divide pandas dataframe elements by its line max
I believe you need convert Series to numpy array for divide by 1d array:
cal_r = my_data.iloc[(my_data["type"]=="cal").values, 3:]
print (cal_r)
1096 1097 1098
0 17.949524 16.247619 15.465079
my_data.iloc[:, 3:] /= cal_r.values
print (my_data)
date type id 1096 1097 1098
0 2014-06-13 cal 1 1.000000 1.000000 1.000000
1 2014-06-13 cow 32 0.029161 -0.052579 -0.098348
2 2014-06-13 cow 47 0.427644 0.401395 0.381012
3 2014-06-13 cow 107 0.231857 0.187632 0.156420
4 2014-06-13 cow 137 0.210654 0.157386 0.124890
5 2014-06-13 cow 255 0.214338 0.154428 0.116671
6 2014-06-13 cow 609 0.339715 0.297749 0.274782
7 2014-06-13 cow 721 0.203523 0.145147 0.105614
8 2014-06-14 cow 817 0.336754 0.303693 0.282788
9 2014-06-14 cow 837 0.537603 0.523857 0.509843
10 2014-06-14 cow 980 0.101236 0.033025 -0.006651
11 2014-06-14 cow 1730 0.474251 0.437866 0.408601
12 2014-06-14 dark 1 9.400010 10.332943 10.837319
Or convert one row DataFrame to Series by DataFrame.squeeze or select first row by position to Series:
my_data.iloc[:, 3:] = my_data.iloc[:, 3:].div(cal_r.squeeze())
#alternative
#my_data.iloc[:, 3:] = my_data.iloc[:, 3:].div(cal_r.iloc[0])
print (my_data)
date type id 1096 1097 1098
0 2014-06-13 cal 1 1.000000 1.000000 1.000000
1 2014-06-13 cow 32 0.029161 -0.052579 -0.098348
2 2014-06-13 cow 47 0.427644 0.401395 0.381012
3 2014-06-13 cow 107 0.231857 0.187632 0.156420
4 2014-06-13 cow 137 0.210654 0.157386 0.124890
5 2014-06-13 cow 255 0.214338 0.154428 0.116671
6 2014-06-13 cow 609 0.339715 0.297749 0.274782
7 2014-06-13 cow 721 0.203523 0.145147 0.105614
8 2014-06-14 cow 817 0.336754 0.303693 0.282788
9 2014-06-14 cow 837 0.537603 0.523857 0.509843
10 2014-06-14 cow 980 0.101236 0.033025 -0.006651
11 2014-06-14 cow 1730 0.474251 0.437866 0.408601
12 2014-06-14 dark 1 9.400010 10.332943 10.837319

Resources