Subset and Loop to create a new column [duplicate] - python-3.x

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?

This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df

In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

Related

Reassigning multiple columns with same array

I've been breaking my head over this simple thing. Ik we can assign a single value to multiple columns using .loc. But how to assign multiple columns with the same array.
Ik I can do this. Let's say we have a dataframe df in which I wish to replace some columns with the array arr:
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
>>df
a b c
0 14 8 5
1 10 25 9
2 14 14 8
3 10 6 7
4 4 18 2
arr = [i for i in range(5)]
#Suppose if I wish to replace columns `a` and `b` with array `arr`
df['a'],df['b']=[arr for j in range(2)]
Desired output:
a b c
0 0 0 16
1 1 1 10
2 2 2 1
3 3 3 20
4 4 4 11
Or I can also do this in a loopwise assignment. But is there a more efficient way without repetition or loops?
Let's try with assign:
cols = ['a', 'b']
df.assign(**dict.fromkeys(cols, arr))
a b c
0 0 0 5
1 1 1 9
2 2 2 8
3 3 3 7
4 4 4 2
I did an assign statement df.a = df.b = arr
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
arr = [i for i in range(5)]
df
a b c
0 2 8 18
1 17 15 25
2 6 5 17
3 12 15 25
4 10 10 6
df.a = df.b = arr
df
a b c
0 0 0 18
1 1 1 25
2 2 2 17
3 3 3 25
4 4 4 6

How to copy values from other dataframe based on condition (same values of specific column)?

I have two dataframes (df1 and df2) and they look like this:
data1 = {'col1':[1,2,3,4,1,2,3,4,1,2,3,4], 'col2':np.arange(1,13)*2}
df1 = pd.DataFrame(data1)
data2 = {'x': [1,2,3,4], 'y': [10,20,40,5]}
df2 = pd.DataFrame(data2)
I would like to add a new column 'col3' to df1 with the values of df2['y'] when df1['col1'] is equal to df2['x']. So my df1 would stay like:
col1 col2 col3
1 2 10
2 4 20
3 6 40
4 8 5
1 10 10
2 12 20
3 14 40
4 16 5
1 18 10
2 20 20
3 22 40
4 24 5
Anyone could help me?
Use map with the dictionary creating from df2
df1['col3'] = df1.col1.map(dict(df2[['x', 'y']].values))
or
df1['col3'] = df1.col1.map(dict(zip(df2.x, df2.y)))
Out[886]:
col1 col2 col3
0 1 2 10
1 2 4 20
2 3 6 40
3 4 8 5
4 1 10 10
5 2 12 20
6 3 14 40
7 4 16 5
8 1 18 10
9 2 20 20
10 3 22 40
11 4 24 5
Use a merge:
df1.merge(df2, how='left', left_on='col1', right_on='x') \
[['col1', 'col2', 'y']] \
.rename(columns={'y': 'col3'})

How to select rows in a DataFrame based on every transition for particular values in a particular column?

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.
I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.
df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()
Given DataFrame:
Index UniqID Value
1    a    1
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
7    b    0
8    b    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
16    c    2
Expected Output:
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
Only expecting those rows when there is a transition from either 0-1 or 1-2.
Thank you in advance.
Use this my solution working for groups with tuples of patterns:
np.random.seed(123)
N = 100
d = {
'UniqID': np.random.choice(list('abcde'), N),
'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)
pat = [(0, 1), (1, 2)]
a = np.array(pat)
s = (df.groupby('UniqID')['Value']
.rolling(2, min_periods=1)
.apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))
mask = (s.mask(s == 0)
.groupby(level=0)
.bfill(limit=1)
.fillna(0)
.astype(bool)
.reset_index(level=0, drop=True))
df = df[mask]
print (df)
UniqID Value
99 a 1
98 a 2
12 a 1
63 a 2
38 a 0
41 a 1
9 a 1
72 a 2
64 b 1
67 b 2
33 b 0
68 b 1
57 b 1
71 b 2
10 b 0
8 b 1
61 c 1
66 c 2
46 c 0
0 c 1
40 c 2
21 d 0
74 d 1
15 d 1
85 d 2
6 d 1
88 d 2
91 d 0
83 d 1
4 d 1
34 d 2
96 d 0
48 d 1
29 d 0
84 d 1
32 e 0
62 e 1
37 e 1
55 e 2
16 e 0
23 e 1
Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)
Similar Sample data:
index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2
Load this in pandas dataframe.
Then,
Using below code:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])
It gives desired dataframe based on assumption:
index id value
1 2 a 0
2 3 a 1
3 4 a 0
4 5 a 1
5 6 a 2
8 9 b 1
9 10 b 2
10 11 b 0
11 12 b 1
12 13 c 0
13 14 c 1
14 15 c 2
Edit: To include transition 1->0, below is updated function:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
return pd.DataFrame(index1+index2)
My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2
df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)
Actually waiting for updates about the 2-1 and 1-2 relationship

Taking all duplicate values in column as single value in pandas

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

Pandas assign value of one column based on another

Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':[10,20,30,40,50,60],
'B':[1,2,1,4,5,4]
})
df
A B
0 10 1
1 20 2
2 30 1
3 40 4
4 50 5
5 60 4
I would like a new column 'C' to have values be equal to those in 'A' where the corresponding values for 'B' are less than 3 else 0.
The desired result is as follows:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Thanks in advance!
Use np.where:
df['C'] = np.where(df['B'] < 3, df['A'], 0)
>>> df
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Here you can use pandas method where direct on the column:
In [3]:
df['C'] = df['A'].where(df['B'] < 3,0)
df
Out[3]:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Timings
In [4]:
%timeit df['A'].where(df['B'] < 3,0)
%timeit np.where(df['B'] < 3, df['A'], 0)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 407 µs per loop
np.where is faster here but pandas where is doing more checking and has more options so it depends on the use case here.

Resources