Merge 2 Different Data Frames - Python 3.6 - python-3.x

Want to merge 2 table and blank should fill with first table rows.
DF1:
Col1 Col2 Col3
A B C
DF2:
Col6 Col8
1 2
3 4
5 6
7 8
9 10
I am expecting result as below:
Col1 Col2 Col3 Col6 Col8
A B C 1 2
A B C 3 4
A B C 5 6
A B C 7 8
A B C 9 10

Use assign, but then is necessary change order of columns:
df = df2.assign(**df1.iloc[0])[df1.columns.append(df2.columns)]
print (df)
Col1 Col2 Col3 Col6 Col8
0 A B C 1 2
1 A B C 3 4
2 A B C 5 6
3 A B C 7 8
4 A B C 9 10
Or concat and replace NaNs by forward filling with ffill:
df = pd.concat([df1, df2], axis=1).ffill()
print (df)
Col1 Col2 Col3 Col6 Col8
0 A B C 1 2
1 A B C 3 4
2 A B C 5 6
3 A B C 7 8
4 A B C 9 10

you can merge both dataframes by index with outer join and forward fill the data
df1.merge(df,left_index=True,right_index=True,how='outer').fillna(method='ffill')
Out:
Col6 Col8 Col1 Col2 Col3
0 1 2 A B C
1 3 4 A B C
2 5 6 A B C
3 7 8 A B C
4 9 10 A B C

Related

Group by and drop duplicates in pandas dataframe

I have a pandas dataframe as below. I want to group by based on all the three columns and retain the group with the max of Col1.
import pandas as pd
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'B', 'B'], 'col2':['1', '1', '1', '1', '2', '3'], 'col3':['5', '5', '2', '2', '2', '3']})
df
col1 col2 col3
0 A 1 5
1 A 1 5
2 A 1 2
3 A 1 2
4 B 2 2
5 B 3 3
My expected output
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I tried below code, but it return me the last row of each group, instead I want to sort by col3 and keep the group with max col3
df.drop_duplicates(keep='last', subset=['col1','col2','col3'])
col1 col2 col3
1 A 1 5
3 A 1 2
4 B 2 2
5 B 3 3
For Example: Here I want to drop 1st group because 2 < 5, so I want to keep the group with col3 as 5
df.sort_values(by=['col1', 'col2', 'col3'], ascending=False)
a_group = df.groupby(['col1', 'col2', 'col3'])
for name, group in a_group:
group = group.reset_index(drop=True)
print(group)
col1 col2 col3
0 A 1 2
1 A 1 2
col1 col2 col3
0 A 1 5
1 A 1 5
col1 col2 col3
0 B 2 2
col1 col2 col3
0 B 3 3
You cant group on all columns since the col you wish to retain max for has different values. Instead dont include that column in the group and consider others:
col_to_max = 'col3'
i = df.columns ^ [col_to_max]
out = df[df[col_to_max] == df.groupby(list(i))[col_to_max].transform('max')]
print(out)
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
So we can do
out = df[df.col3==df.groupby(['col1','col2'])['col3'].transform('max')]
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I believe you can use groupby with nlargest(2). Also make sure that your 'col3' is a numerical one.
>>> df['col3'] = df['col3'].astype(int)
>>> df.groupby(['col1','col2'])['col3'].nlargest(2).reset_index().drop('level_2',axis=1)
col1 col2 col3
0 A 1 5
1 A 1 5
2 B 2 2
3 B 3 3
You can get index which doesn't has col3 max value and duplicated index and drop the intersection
ind = df.assign(max = df.groupby("col1")["col3"].transform("max")).query("max != col3").index
ind2 = df[df.duplicated(keep=False)].index
df.drop(set(ind).intersection(ind2))
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3

groupby column in pandas

I am trying to groupby columns value in pandas but I'm not getting.
Example:
Col1 Col2 Col3
A 1 2
B 5 6
A 3 4
C 7 8
A 11 12
B 9 10
-----
result needed grouping by Col1
Col1 Col2 Col3
A 1,3,11 2,4,12
B 5,9 6,10
c 7 8
but I getting this ouput
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025BEB4D6E50>
I am getting using excel power query with function group by and count all rows, but I canĀ“t get the same with python and pandas. Any help?
Try this
(
df
.groupby('Col1')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index()
)
it outputs
Col1 Col2 Col3
0 A 1,3,11 2,4,12
1 B 5,9 6,10
2 C 7 8
Very good I created solution between 0 and 0:
df[df['A'] != 0].groupby((df['A'] == 0).cumsum()).sub()
It will group column between 0 and 0 and sum it

Shuffle pandas columns

I have the following data frame:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
2 7 8 9 2
and I would like to have a shuffled output like :
Col3 Col1 Col2 Type
0 3 1 2 1
1 6 4 5 1
2 9 7 8 2
How to achieve this?
Use DataFrame.sample with axis=1:
df = df.sample(frac=1, axis=1)
If need last column not changed position:
a = df.columns[:-1].to_numpy()
np.random.shuffle(a)
print (a)
['Col3' 'Col1' 'Col2']
df = df[np.append(a, ['Type'])]
print (df)
Col2 Col3 Col1 Type
0 3 1 2 1
1 6 4 5 1
2 9 7 8 2

pandas fill column with random numbers with a total for each row

I've got a pandas dataframe like this:
id foo
0 A col1
1 A col2
2 B col1
3 B col3
4 D col4
5 C col2
I'd like to create four additional columns based on unique values in foo column. col1,col2, col3, col4
id foo col1 col2 col3 col4
0 A col1 75 20 5 0
1 A col2 20 80 0 0
2 B col1 82 10 8 0
3 B col3 5 4 80 11
4 D col4 0 5 10 85
5 C col2 12 78 5 5
The logic for creating the columns is as follows:
if foo = col1 then col1 contains a random number between 75-100 and the other columns (col2, col3, col4) contains random numbers, such that the total for each row is 100
I can manually create a new column and assign a random number, but I'm unsure how to include the logic of sum for each row of 100.
Appreciate any help!
My two cents
d=[]
s=np.random.randint(75,100,size=6)
for x in 100-s:
a=np.random.randint(100, size=3)
b=np.random.multinomial(x, a /a.sum())
d.append(b.tolist())
s=[np.random.choice(x,4,replace= False) for x in np.column_stack((s,np.array(d))) ]
df=pd.concat([df,pd.DataFrame(s,index=df.index)],1)
df
id foo 0 1 2 3
0 A col1 16 1 7 76
1 A col2 4 2 91 3
2 B col1 4 4 1 91
3 B col3 78 8 8 6
4 D col4 8 87 3 2
5 C col2 2 0 11 87
IIUC,
df['col1'] = df.apply(lambda x: np.where(x['foo'] == 'col1', np.random.randint(75,100), np.random.randint(0,100)), axis=1)
df['col2'] = df.apply(lambda x: np.random.randint(0,100-x['col1'],1)[0], axis=1)
df['col3'] = df.apply(lambda x: np.random.randint(0,100-x[['col1','col2']].sum(),1)[0], axis=1)
df['col4'] = 100 - df[['col1','col2','col3']].sum(1).astype(int)
df[['col1','col2','col3','col4']].sum(1)
Output:
id foo col1 col2 col3 col4
0 A col1 92 2 5 1
1 A col2 60 30 0 10
2 B col1 89 7 3 1
3 B col3 72 12 0 16
4 D col4 41 52 3 4
5 C col2 72 2 22 4
My Approach
import numpy as np
def weird(lower, upper, k, col, cols):
first_num = np.random.randint(lower, upper)
delta = upper - first_num
the_rest = np.random.rand(k - 1)
the_rest = the_rest / the_rest.sum() * (delta)
the_rest = the_rest.astype(int)
the_rest[-1] = delta - the_rest[:-1].sum()
key = lambda x: x != col
return dict(zip(sorted(cols, key=key), [first_num, *the_rest]))
def f(c): return weird(75, 100, 4, c, ['col1', 'col2', 'col3', 'col4'])
df.join(pd.DataFrame([*map(f, df.foo)]))
id foo col1 col2 col3 col4
0 A col1 76 2 21 1
1 A col2 11 76 11 2
2 B col1 75 4 10 11
3 B col3 0 1 97 2
4 D col4 5 4 13 78
5 C col2 9 77 6 8
If we subtract the numbers between 75-100 by 75, the problem become generating a table of random number between 0-25 whose each row sums to 25. That can be solve by reverse cumsum:
num_cols = 4
# generate random number and sort them in each row
a = np.sort(np.random.randint(0,25, (len(df), num_cols)), axis=1)
# create a dataframe and attach a last column with values 25
new_df = pd.DataFrame(a)
new_df[num_cols] = 25
# compute the difference, which are our numbers and add to the dummies:
dummies = pd.get_dummies(df.foo) * 75
dummies += new_df.diff(axis=1).fillna(new_df[0]).values
And dummies is
col1 col2 col3 col4
0 76.0 13.0 2.0 9.0
1 1.0 79.0 2.0 4.0
2 76.0 5.0 8.0 9.0
3 1.0 3.0 79.0 10.0
4 1.0 2.0 1.0 88.0
5 1.0 82.0 1.0 7.0
which can be concatenated to the original dataframe.

Pandas: Calculate Median of Group over Columns

Given the following data frame:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', 'A','A','A','B','B'],
'COL2' : ['AA','AA','BB','BB','BB','BB'],
'COL3' : [2,3,4,5,4,2],
'COL4' : [0,1,2,3,4,2]})
df
COL1 COL2 COL3 COL4
0 A AA 2 0
1 A AA 3 1
2 A BB 4 2
3 A BB 5 3
4 B BB 4 4
5 B BB 2 2
I would like, as efficiently as possible (i.e. via groupby and lambda x or better), to find the median of columns 3 and 4 for each distinct group of columns 1 and 2.
The desired result is as follows:
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3
5 B BB 2 2 3
Thanks in advance!
You already had the idea -- groupby COL1 and COL2 and calculate median.
m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(np.median)
m.name = 'MEDIAN'
print df.join(m, on=['COL1', 'COL2'])
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3.0
5 B BB 2 2 3.0
df.groupby(['COL1', 'COL2']).median()[['COL3','COL4']]

Resources