I have a data frame df and I use several columns from it to groupby:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
In short: How do I get group-wise statistics for a dataframe?
Quick Answer:
The simplest way to get row counts per group is by calling .size(), which returns a Series:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame (instead of a Series) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size() to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts') to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.
On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Swiss Army Knife: GroupBy.describe
Returns count, mean, std, and other useful statistics per-group.
df.groupby(['A', 'B'])['C'].describe()
count mean std min 25% 50% 75% max
A B
bar one 1.0 0.40 NaN 0.40 0.40 0.40 0.40 0.40
three 1.0 2.24 NaN 2.24 2.24 2.24 2.24 2.24
two 1.0 -0.98 NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one 2.0 1.36 0.58 0.95 1.15 1.36 1.56 1.76
three 1.0 -0.15 NaN -0.15 -0.15 -0.15 -0.15 -0.15
two 2.0 1.42 0.63 0.98 1.20 1.42 1.65 1.87
To get specific statistics, just select them,
df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]
count mean
A B
bar one 1.0 0.400157
three 1.0 2.240893
two 1.0 -0.977278
foo one 2.0 1.357070
three 1.0 -0.151357
two 2.0 1.423148
Note: if you only need to compute 1 or 2 stats then it might be
faster to use groupby.agg and just compute those columns otherwise
you are performing wasteful computation.
describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).
You also get different statistics for string data. Here's an example,
df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)
with pd.option_context('precision', 2):
display(df2.groupby(['A', 'B'])
.describe(include='all')
.dropna(how='all', axis=1))
C D
count mean std min 25% 50% 75% max count unique top freq
A B
bar one 14.0 0.40 5.76e-17 0.40 0.40 0.40 0.40 0.40 14 1 a 14
three 14.0 2.24 4.61e-16 2.24 2.24 2.24 2.24 2.24 14 1 b 14
two 9.0 -0.98 0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98 9 1 c 9
foo one 22.0 1.43 4.10e-01 0.95 0.95 1.76 1.76 1.76 22 2 a 13
three 15.0 -0.15 0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15 15 1 c 15
two 26.0 1.49 4.48e-01 0.98 0.98 1.87 1.87 1.87 26 2 b 15
For more information, see the documentation.
pandas >= 1.1: DataFrame.value_counts
This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the GroupBy and is faster.
df.value_counts(subset=['col1', 'col2'])
Minimal Example
# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df.value_counts(['A', 'B'])
A B
foo two 2
one 2
three 1
bar two 1
three 1
one 1
dtype: int64
Other Statistical Analysis Tools
If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.
To get multiple stats, collapse the index, and retain column names:
df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df
Produces:
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
Please try this code
new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df
I think that code will add a column called 'count it' which count of each group
Create a group object and call methods like below example:
grp = df.groupby(['col1', 'col2', 'col3'])
grp.max()
grp.mean()
grp.describe()
If you are familiar with tidyverse R packages, here is a way to do it in python:
from datar.all import tibble, rnorm, f, group_by, summarise, mean, n, rep
df = tibble(
col1=rep(['A', 'B'], 5),
col2=rep(['C', 'D'], each=5),
col3=rnorm(10),
col4=rnorm(10)
)
df >> group_by(f.col1, f.col2) >> summarise(
count=n(),
col3_mean=mean(f.col3),
col4_mean=mean(f.col4)
)
col1 col2 n mean_col3 mean_col4
0 A C 3 -0.516402 0.468454
1 A D 2 -0.248848 0.979655
2 B C 2 0.545518 -0.966536
3 B D 3 -0.349836 -0.915293
[Groups: ['col1'] (n=2)]
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.
Another alternative:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
A B C D
0 foo one 0.808197 2.057923
1 bar one 0.330835 -0.815545
2 foo two -1.664960 -2.372025
3 bar three 0.034224 0.825633
4 foo two 1.131271 -0.984838
5 bar two 2.961694 -1.122788
6 foo one -0.054695 0.503555
7 foo three 0.018052 -0.746912
pd.crosstab(df.A, df.B).stack().reset_index(name='count')
Output:
A B count
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
Related
I have a data frame df and I use several columns from it to groupby:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
In short: How do I get group-wise statistics for a dataframe?
Quick Answer:
The simplest way to get row counts per group is by calling .size(), which returns a Series:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame (instead of a Series) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size() to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts') to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.
On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Swiss Army Knife: GroupBy.describe
Returns count, mean, std, and other useful statistics per-group.
df.groupby(['A', 'B'])['C'].describe()
count mean std min 25% 50% 75% max
A B
bar one 1.0 0.40 NaN 0.40 0.40 0.40 0.40 0.40
three 1.0 2.24 NaN 2.24 2.24 2.24 2.24 2.24
two 1.0 -0.98 NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one 2.0 1.36 0.58 0.95 1.15 1.36 1.56 1.76
three 1.0 -0.15 NaN -0.15 -0.15 -0.15 -0.15 -0.15
two 2.0 1.42 0.63 0.98 1.20 1.42 1.65 1.87
To get specific statistics, just select them,
df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]
count mean
A B
bar one 1.0 0.400157
three 1.0 2.240893
two 1.0 -0.977278
foo one 2.0 1.357070
three 1.0 -0.151357
two 2.0 1.423148
Note: if you only need to compute 1 or 2 stats then it might be
faster to use groupby.agg and just compute those columns otherwise
you are performing wasteful computation.
describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).
You also get different statistics for string data. Here's an example,
df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)
with pd.option_context('precision', 2):
display(df2.groupby(['A', 'B'])
.describe(include='all')
.dropna(how='all', axis=1))
C D
count mean std min 25% 50% 75% max count unique top freq
A B
bar one 14.0 0.40 5.76e-17 0.40 0.40 0.40 0.40 0.40 14 1 a 14
three 14.0 2.24 4.61e-16 2.24 2.24 2.24 2.24 2.24 14 1 b 14
two 9.0 -0.98 0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98 9 1 c 9
foo one 22.0 1.43 4.10e-01 0.95 0.95 1.76 1.76 1.76 22 2 a 13
three 15.0 -0.15 0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15 15 1 c 15
two 26.0 1.49 4.48e-01 0.98 0.98 1.87 1.87 1.87 26 2 b 15
For more information, see the documentation.
pandas >= 1.1: DataFrame.value_counts
This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the GroupBy and is faster.
df.value_counts(subset=['col1', 'col2'])
Minimal Example
# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df.value_counts(['A', 'B'])
A B
foo two 2
one 2
three 1
bar two 1
three 1
one 1
dtype: int64
Other Statistical Analysis Tools
If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.
To get multiple stats, collapse the index, and retain column names:
df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df
Produces:
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
Please try this code
new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df
I think that code will add a column called 'count it' which count of each group
Create a group object and call methods like below example:
grp = df.groupby(['col1', 'col2', 'col3'])
grp.max()
grp.mean()
grp.describe()
If you are familiar with tidyverse R packages, here is a way to do it in python:
from datar.all import tibble, rnorm, f, group_by, summarise, mean, n, rep
df = tibble(
col1=rep(['A', 'B'], 5),
col2=rep(['C', 'D'], each=5),
col3=rnorm(10),
col4=rnorm(10)
)
df >> group_by(f.col1, f.col2) >> summarise(
count=n(),
col3_mean=mean(f.col3),
col4_mean=mean(f.col4)
)
col1 col2 n mean_col3 mean_col4
0 A C 3 -0.516402 0.468454
1 A D 2 -0.248848 0.979655
2 B C 2 0.545518 -0.966536
3 B D 3 -0.349836 -0.915293
[Groups: ['col1'] (n=2)]
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.
Another alternative:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
A B C D
0 foo one 0.808197 2.057923
1 bar one 0.330835 -0.815545
2 foo two -1.664960 -2.372025
3 bar three 0.034224 0.825633
4 foo two 1.131271 -0.984838
5 bar two 2.961694 -1.122788
6 foo one -0.054695 0.503555
7 foo three 0.018052 -0.746912
pd.crosstab(df.A, df.B).stack().reset_index(name='count')
Output:
A B count
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Dataframe:
col1 col2
A 0
A 1
A nan
B 0
B 1
C and so on...
I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2 wherever col1=='A'.
Code so far:
df.loc[(df.col1=='A') & (df.col2==0),'col2'] = 2
df.loc[(df.col1=='A') & (df.col2==1),'col2'] = 0
df.loc[(df.col1=='A') & (df.col2==2),'col2'] = 1
# Hope you understand why I am converting 0 to 2 first then to 1.
# Because if I convert all zeroes to 1 then all 1's will be converted to
# 0 in subsequent conversion.
Unique values in col2 are 0,1 and nan.
Is there a correct/better way of doing this?
Also, is there a way to directly swap these numbers instead of assignment operators?
One solution using Series.where and astype(bool) with ~ (NOT operator) and then back to astype(int). Then use loc with boolean indexing to assign back to DataFrame:
df.loc[df.col1.eq('A'), 'col2'] = df.col2.where(df.col2.isna(),
(~df.col2.astype(bool)).astype(int))
[out]
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C NaN
You can also try with df.mask():
m=df.col1.eq('A')&df.col2.isna() #condition
df.col2=1-df.col2.mask(m)
print(df)
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 1.0
4 B 0.0
I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2
wherever col1=='A'.
use np.where
df['col2] = np.where(df['col1'] == 'A', np.where(df['col2'] == 1, 0 , np.where(df['col2'].isnull() == True, df['col2'],1)),df['col2'])
Output
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C 0.0
In this case, you can also use your own function in combination with apply().
# import pandas
import pandas as pd
# make a sample data
list_of_rows = [
{'col1': A, 'col2': 1},
{'col1': A, 'col2': 0},
{'col1': A, 'col2': None},
{'col1': B, 'col2': 0},
{'col1': B, 'col2': 1},
{'col1': B, 'col2': None},
]
# make a pandas data frame
df = pd.DataFrame(list_of_rows)
# define a function
def change_values(row):
if row['col2'] == 0:
return 1
if row['col2'] == 1:
return 0
return row['col2']
# apply function to dataframe
df['col2'] = df.apply(lambda row: change_values(row), axis=1)
I have two dataframes:
df1 = pd.DataFrame([['ida', 1], ['idb', 2], ['idc', 3]], columns=['A','B'])
df2 = pd.DataFrame([['idb', 20], ['ida', 10], ['idb', 21], ['idb', 22]], columns=['A', 'C'])
and I would like to append the data from df2to df1into a list:
df3 =
|A B C
---------------
0 |ida 1 [10]
1 |idb 2 [20, 21, 22]
2 |idc 3 NaN
I can merge both frames:
df1.merge(df2, how='left')
A B C
0 ida 1 10.0
1 idb 2 20.0
2 idb 2 21.0
3 idb 2 22.0
4 idc 3 NaN
But then how do I "merge" matching rows? Also, in reality df2 is a lot larger and I only want to copy the columns "C", not columns "D", "E", "F"...
Alternatively, I can create a new column in df1 and then iterate over df2 to fill it:
for n, row in df2.iterrows():
idx = df1.index[row['A'] == df1['A']]
for i in idx: # hopefully only 1 or 0 values in idx
<assign value> df1.at[i, 'A'] = ???
The reason I want to have lists is that there is a flexible number of 'C'-values and I later want to calculate the average, standard deviation, ...
Edit: Typo
With version 0.24.x upwards of pandas you can use:
import numpy as np
import pandas as pd
df3 = (df1.merge(
df2.groupby('A')['C'].apply(np.array),
how='left',
left_on='A',
right_index=True))
And for your summary statistics:
df3['C'].apply(lambda x: np.std(x))
df3['C'].apply(lambda x: np.mean(x))
This is a perfect example of merging and after that groupby with applying the list function like the following:
# Merge on key columns A
df3 = pd.merge(df1, df2, on='A', how='outer')
# Output1
A B C
0 ida 1 10.0
1 idb 2 20.0
2 idb 2 21.0
3 idb 2 22.0
4 idc 3 NaN
# Groupby and apply list to keep values
df_final = df3.groupby('A').C.apply(list).reset_index()
A C
0 ida [10.0]
1 idb [20.0, 21.0, 22.0]
2 idc [nan]
EDIT:
If you want to only bring certain columns after a merge you can do the following:
df3 = pd.merge(df1, df2[['A', 'C']], on='A', how='outer')
I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1