The code snippet below:
import pandas as pd
pd.DataFrame(
{'type': ['A', 'B', 'A', 'C', 'C', 'A'],
'value': [5, 6, 7, 7, 9, 1]}
)
Gives:
type value
0 A 5
1 B 6
2 A 7
3 C 7
4 C 9
5 A 1
Want this:-
pd.DataFrame(
{'A': [5, 0, 7, 0, 0, 1],
'B': [0, 6, 0, 0, 0, 0],
'C': [0, 0, 0, 7, 9, 0]}
)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
I did try using for loops but strive to be more efficient. Would be a great help!
Use get_dummies and multiply with the second column:
final_df=pd.get_dummies(df['type']).mul(df['value'],axis=0)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
Use Series.unstack for reshape:
df = df.set_index('type', append=True)['value'].unstack(fill_value=0).rename_axis(None, axis=1)
print (df)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
Or numpy solution with multiple indicator DataFrame created by get_dummies with numpy array:
df = pd.get_dummies(df['type']) * df['value'].values[:, None]
print (df)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
Related
I have a pandas dataframe as below. I want to perform the below condition:
if Column 'A' is 1 then update the value of column 'F' with the previous value of 'F'. This can be done row by row iteration but it is not efficient way of doing that. I want a vectorized method of doing that.
df = pd.DataFrame({'A':[1,1,1, 0, 0, 0, 1, 0, 0], 'C':[1,1,1, 0, 0, 0, 1, 1, 1], 'D':[1,1,1, 0, 0, 0, 1, 1, 1],
'F':[2,0,0, 0, 0, 1, 1, 1, 1]})
df
A C D F
0 1 1 1 2
1 1 1 1 0
2 1 1 1 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
My desired output:
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
I tried the below code, but it doesnot work because when I use shift, it doesnot take the updated previous row.
df['F'] = df.groupby(['A'])['F'].shift(1)
df
A C D F
0 1 1 1 NaN
1 1 1 1 2.0
2 1 1 1 0.0
3 0 0 0 NaN
4 0 0 0 0.0
5 0 0 0 0.0
6 1 1 1 0.0
7 0 1 1 1.0
8 0 1 1 1.0
transform('first')
df.F.groupby(df.A.rsub(1).cumsum()).transform('first')
0 2
1 2
2 2
3 0
4 0
5 1
6 1
7 1
8 1
Name: F, dtype: int64
Assign to column 'F'
df.assign(F=df.F.groupby(df.A.rsub(1).cumsum()).transform('first'))
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
we also know how to do it without groupby:
where=df['A'].eq(1)&df['A'].ne(df['A'].shift())
df['F']=df['F'].where(where).ffill().mask(df['A'].ne(1),df['F'])
print(df)
A C D F
0 1 1 1 2.0
1 1 1 1 2.0
2 1 1 1 2.0
3 0 0 0 0.0
4 0 0 0 0.0
5 0 0 0 1.0
6 1 1 1 1.0
7 0 1 1 1.0
8 0 1 1 1.0
I want to know how can I replace a part of a big matrix with another small matrix by a non-sequence order of row & columns. I mean
`
a=np.zeros([15,15])
B=np.ones([5,5])
ind1=[0,1,2,3,4]
ind2=[0,5,8,7,12]
#Now I want to replace like this
a[ind1,ind1]=a[ind1,ind1]+B
#and
a[ind2,ind2]=a[ind2,ind2]+B
`
It can be done very easily in Matlab, but I do not know why, in python, indexing of columns does not work with a list of numbers?
Thank you in advance
Your problem is all about numpy not Python. Learn about indexing in numpy - https://docs.scipy.org/doc/numpy-1.15.1/user/basics.indexing.html. It is not that different from MATLAB indexing actually.
For example:
import numpy as np
a = np.zeros(shape=[15, 15], dtype=int)
b = np.ones(shape=[5, 5], dtype=int)
a[0:5, 0:5] += b
a[0:5, 5:10] += b * 2
ind_1 = [11, 6, 7, 12, 13]
ind_2 = [9, 7, 14, 13, 4]
a[np.ix_(ind_1, ind_2)] += b * 3
print(a)
Output:
[[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]
[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]
[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]
[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]
[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 3 0 0 3 0 3 0 0 0 3 3]
[0 0 0 0 3 0 0 3 0 3 0 0 0 3 3]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 3 0 0 3 0 3 0 0 0 3 3]
[0 0 0 0 3 0 0 3 0 3 0 0 0 3 3]
[0 0 0 0 3 0 0 3 0 3 0 0 0 3 3]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
I have this dataframe:
df= pd.DataFrame({'Jan' : ['US', 'GB', 'NL', 'CH', 'GB', 'US'],
'Feb': ['US', 'AU', 'RU', 'NO', 'AU', 0],
'Mar' : ['PL', 'AU', 'FI', 'US', 'CH', 'CH']})
I would like to create stacked barchart to show the count of countries per month. So I need first to transform this dataframe to this form :
Jan Feb Mar
US 2 1 1
GB 2 0 0
NL 1 0 0
CH 1 0 2
AU 0 2 1
RU 0 1 0
NO 0 1 0
PL 0 0 1
FI 0 0 1
0 0 1 0
My dataframe is large but I want to display the most 10 common countries for each month on the stacked barplot. I noticed that pandas pivot isn't doing the Job.
You could
In [46]: s = df.stack().reset_index()
In [47]: pd.crosstab(s[0], s['level_1']).rename_axis(None, 1).rename_axis(None, 0)
Out[47]:
Feb Jan Mar
0 1 0 0
AU 2 0 1
CH 0 1 2
FI 0 0 1
GB 0 2 0
NL 0 1 0
NO 1 0 0
PL 0 0 1
RU 1 0 0
US 1 2 1
I have the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'id': [1, 1, 1, 1, 2, 2,2, 2, 3, 3, 3, 3],
'name': ['A', 'B', 'C', 'D','A', 'B','C', 'D', 'A', 'B','C', 'D'],
'Value': [1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 3, 5]
},
columns=['name','id','Value'])`
I can sort the data using id and value as shown below:
df.sort_values(['id','Value'],ascending = [True,False])
The table that I print will be appearing as follow:
name id Value
D 1 4
C 1 3
B 1 2
A 1 1
B 2 6
A 2 5
D 2 2
C 2 0
B 3 6
D 3 5
A 3 4
C 3 3
I would like to create 4 new columns (Rank1, Rank2, Rank3, Rank4) if element in the column name is highest value, the column Rank1 will be assign as 1 else 0. if element in the column name is second highest value, he column Rank2 will be assign as 1 else 0.
Same for Rank3 and Rank4.
How could I do that?
Thanks.
Zep
Use:
df = df.join(pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
print (df)
name id Value Rank1 Rank2 Rank3 Rank4
3 D 1 4 1 0 0 0
2 C 1 3 0 1 0 0
1 B 1 2 0 0 1 0
0 A 1 1 0 0 0 1
5 B 2 6 1 0 0 0
4 A 2 5 0 1 0 0
7 D 2 2 0 0 1 0
6 C 2 0 0 0 0 1
9 B 3 6 1 0 0 0
11 D 3 5 0 1 0 0
8 A 3 4 0 0 1 0
10 C 3 3 0 0 0 1
Details:
For count per groups use GroupBy.cumcount, then add 1:
print (df.groupby('id').cumcount().add(1))
3 1
2 2
1 3
0 4
5 1
4 2
7 3
6 4
9 1
11 2
8 3
10 4
dtype: int64
For indicator columns use get_dumes with add_prefix:
print (pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
Rank1 Rank2 Rank3 Rank4
3 1 0 0 0
2 0 1 0 0
1 0 0 1 0
0 0 0 0 1
5 1 0 0 0
4 0 1 0 0
7 0 0 1 0
6 0 0 0 1
9 1 0 0 0
11 0 1 0 0
8 0 0 1 0
10 0 0 0 1
This does not require a prior sort
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(np.argsort).rsub(4)
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
More dynamic
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(lambda x: len(x) - np.argsort(x))
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
Given df
A = pd.DataFrame([[1, 5, 2, 1, 2], [2, 4, 4, 1, 2], [3, 3, 1, 1, 2], [4, 2, 2, 3, 0],
[5, 1, 4, 3, -4], [1, 5, 2, 3, -20], [2, 4, 4, 2, 0], [3, 3, 1, 2, -1],
[4, 2, 2, 2, 0], [5, 1, 4, 2, -2]],
columns=['a', 'b', 'c', 'd', 'e'],
index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
How can I create a column 'f', that corresponds to the last value in column 'e' before a change in value in column 'd', and holds that value until the next change in value in column 'd' the output would be:
a b c d e f
1 1 5 2 1 2 nan
2 2 4 4 1 2 nan
3 3 3 1 1 2 nan
4 4 2 2 3 0 2
5 5 1 4 3 -4 2
6 1 5 2 3 -20 2
7 2 4 4 2 0 -20
8 3 3 1 2 -1 -20
9 4 2 2 2 0 -20
10 5 1 4 2 -2 -20
Edit: #Noobie presented a solution that when applied in real data, it breaks down when there's a smaller than previous value in column 'd'
I think we should offer better native support for dealing with contiguous groups, but until then you can us the compare-cumsum-groupby pattern:
g = (A["d"] != A["d"].shift()).cumsum()
A["f"] = A["e"].groupby(g).last().shift().loc[g].values
which gives me
In [41]: A
Out[41]:
a b c d e f
1 1 5 2 1 2 NaN
2 2 4 4 1 2 NaN
3 3 3 1 1 2 NaN
4 4 2 2 2 0 2.0
5 5 1 4 2 -4 2.0
6 1 5 2 2 -20 2.0
7 2 4 4 3 0 -20.0
8 3 3 1 3 -1 -20.0
9 4 2 2 3 0 -20.0
10 5 1 4 3 -2 -20.0
This works because g is a count corresponding to each contiguous group of d values. Note that in this case, using the example you posted, g is the same as column "d", but that needn't be the case. Once we have g, we can use it to group column e:
In [55]: A["e"].groupby(g).last()
Out[55]:
d
1 2
2 -20
3 -2
Name: e, dtype: int64
and then
In [57]: A["e"].groupby(g).last().shift()
Out[57]:
d
1 NaN
2 2.0
3 -20.0
Name: e, dtype: float64
In [58]: A["e"].groupby(g).last().shift().loc[g]
Out[58]:
d
1 NaN
1 NaN
1 NaN
2 2.0
2 2.0
2 2.0
3 -20.0
3 -20.0
3 -20.0
3 -20.0
Name: e, dtype: float64
easy my friend. unleash the POWER OF PANDAS !
A.sort_values(by = 'd', inplace = True)
A['lag'] = A.e.shift(1)
A['output'] = A.groupby('d').lag.transform(lambda x : x.iloc[0])
A
Out[57]:
a b c d e lag output
1 1 5 2 1 2 NaN NaN
2 2 4 4 1 2 2.0 NaN
3 3 3 1 1 2 2.0 NaN
4 4 2 2 2 0 2.0 2.0
5 5 1 4 2 -4 0.0 2.0
6 1 5 2 2 -20 -4.0 2.0
7 2 4 4 3 0 -20.0 -20.0
8 3 3 1 3 -1 0.0 -20.0
9 4 2 2 3 0 -1.0 -20.0
10 5 1 4 3 -2 0.0 -20.0