The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:
In [563]: grouped['D'].agg({'result1' : np.sum,
.....: 'result2' : np.mean})
.....:
Out[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938
However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.
What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.
For example, I've tried something like
grouped.agg({'C_sum' : lambda x: x['C'].sum(),
'C_std': lambda x: x['C'].std(),
'D_sum' : lambda x: x['D'].sum()},
'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)
but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).
Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.
If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Using apply and returning a Series
Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
For the first part you can pass a dict of column names for keys and a list of functions for the values:
In [28]: df
Out[28]:
A B C D E GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1
In [26]: f = {'A':['sum','mean'], 'B':['prod']}
In [27]: df.groupby('GRP').agg(f)
Out[27]:
A B
sum mean prod
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060
UPDATE 1:
Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.
Here's a hacky workaround:
In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}
In [69]: df.groupby('GRP').agg(f)
Out[69]:
A B D
sum mean prod <lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901
Here, the resultant 'D' column is made up of the summed 'E' values.
UPDATE 2:
Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.
In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()
In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
In [97]: df.groupby('GRP').agg(f)
Out[97]:
A B D
sum mean prod my name
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441
Pandas >= 0.25.0, named aggregations
Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:
Example:
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
a b c d group
0 0.521279 0.914988 0.054057 0.125668 0
1 0.426058 0.828890 0.784093 0.446211 0
2 0.363136 0.843751 0.184967 0.467351 1
3 0.241012 0.470053 0.358018 0.525032 1
Apply GroupBy.agg with named aggregation:
df.groupby('group').agg(
a_sum=('a', 'sum'),
a_mean=('a', 'mean'),
b_mean=('b', 'mean'),
c_sum=('c', 'sum'),
d_range=('d', lambda x: x.max() - x.min())
)
a_sum a_mean b_mean c_sum d_range
group
0 0.947337 0.473668 0.871939 0.838150 0.320543
1 0.604149 0.302074 0.656902 0.542985 0.057681
As an alternative (mostly on aesthetics) to Ted Petrou's answer, I found I preferred a slightly more compact listing. Please don't consider accepting it, it's just a much-more-detailed comment on Ted's answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:
df.groupby('group') \
.apply(lambda x: pd.Series({
'a_sum' : x['a'].sum(),
'a_max' : x['a'].max(),
'b_mean' : x['b'].mean(),
'c_d_prodsum' : (x['c'] * x['d']).sum()
})
)
a_sum a_max b_mean c_d_prodsum
group
0 0.530559 0.374540 0.553354 0.488525
1 1.433558 0.832443 0.460206 0.053313
I find it more reminiscent of dplyr pipes and data.table chained commands. Not to say they're better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized def functions for these types of operations. This is just an alternative, not necessarily better.)
I generated data in the same manner as Ted, I'll add a seed for reproducibility.
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.374540 0.950714 0.731994 0.598658 0
1 0.156019 0.155995 0.058084 0.866176 0
2 0.601115 0.708073 0.020584 0.969910 1
3 0.832443 0.212339 0.181825 0.183405 1
New in version 0.25.0.
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where
The keywords are the output column names
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
>>> animals = pd.DataFrame({
... 'kind': ['cat', 'dog', 'cat', 'dog'],
... 'height': [9.1, 6.0, 9.5, 34.0],
... 'weight': [7.9, 7.5, 9.9, 198.0]
... })
>>> print(animals)
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=pd.NamedAgg(column='height', aggfunc='min'),
... max_height=pd.NamedAgg(column='height', aggfunc='max'),
... average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=('height', 'min'),
... max_height=('height', 'max'),
... average_weight=('weight', np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
>>> print(
... animals
... .groupby('kind')
... .height
... .agg(
... min_height='min',
... max_height='max',
... )
... )
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
This is a twist on 'exans' answer that uses Named Aggregations. It's the same but with argument unpacking which allows you to still pass in a dictionary to the agg function.
The named aggs are a nice feature, but at first glance might seem hard to write programmatically since they use keywords, but it's actually simple with argument/keyword unpacking.
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]})
agg_dict = {
"min_height": pd.NamedAgg(column='height', aggfunc='min'),
"max_height": pd.NamedAgg(column='height', aggfunc='max'),
"average_weight": pd.NamedAgg(column='weight', aggfunc=np.mean)
}
animals.groupby("kind").agg(**agg_dict)
The Result
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Ted's answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:
create a dataframe
df = pd.DataFrame({
'a': [1, 2, 3, 4, 5, 6],
'b': [1, 1, 0, 1, 1, 0],
'c': ['x', 'x', 'y', 'y', 'z', 'z']
})
print(df)
a b c
0 1 1 x
1 2 1 x
2 3 0 y
3 4 1 y
4 5 1 z
5 6 0 z
grouping and aggregating with apply (using multiple columns)
print(
df
.groupby('c')
.apply(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)]
.mean()
)
c
x 2.0
y 4.0
z 5.0
grouping and aggregating with aggregate (using multiple columns)
I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.
It seems obvious now, but as long as you don't select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.
only access to the selected column
df.groupby('c')['a'].aggregate(lambda x: x[x > 1].mean())
access to all columns since selection is after all the magic
df.groupby('c').aggregate(lambda x: x[(x['a'] > 1) & (x['b'] == 1)].mean())['a']
or similarly
df.groupby('c').aggregate(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)].mean())
I hope this helps.
In JavaScript we have +number which converts a string into a number based on its value, as shown below:
x = '123' => +x returns 123
y = '12.3' => +y returns 12.3
Now if I use int in python it returns:
x = '123' => int(x) returns 123
y = '12.3' => int(y) returns 12 which is wrong as it should return 12.3
Whereas if I use float:
x = '123' => float(x) returns 123.0 which is wrong as it should return 123 only though I know 123.0 is same as 123 but I want to parse it to some other query language which identifies the numbers as mentioned above.
y = '12.3' => float(y) returns 12.3
If, for some reason you do not want to use literal_eval, I want to offer another solution using regex. There isn't a necessarily "pretty" built-in method in Python to convert str representations of int and float to int and float respectively.
import re
def conv(n):
if isinstance(n, str) and len(n) > 0:
if re.match(r"^-{0,1}[0-9]+$", n):
return int(n)
elif re.match(r"^-{0,1}[0-9]+(\.|e)[0-9]+$", n):
return float(n)
return n
>>> conv("-123")
-123
>>> conv("123.4")
123.4
>>> conv("2e2")
200.0
JavaScript doesn't have types for integers like other languages (like C) do, so 123 in JS is really 123.0 stored as double but displayed without the decimals. (There's BigInt if you want to learn further, but it would have been 123n).
In Python, you can use literal_eval to get the numeric (literal) value from a string representation:
import ast
ast.literal_eval("123") # -> 123 (int)
ast.literal_eval("123.") # -> 123.0 (float)
ast.literal_eval("123.0") # -> 123.0 (float)
I have a train dataset which has 43 attributes. Each of the attributes have some tuple values as objects (as in strings with certain characters).
Now, I'm trying to scale the values using a scaler, but it gives the following error:
could not convert string to float: '?'
Now, I don't know how to convert objects to int or float in a single command and converting it for each of the 43 attributes one by one is a bit tedious.
So I want to know how to do it for the complete dataset with a single command.
I use the convert function which tries to parse the string as a float.
If it cannot, it tries to parse it as a int, and if it still cannot, assigns the value 0 (you can change the default value is the string is not a int or a float to something else)
l = []
def convert(str):
x = 0
try:
x = int(str)
except:
try:
x = float(str)
except:
pass
l.append(x)
for i in ['1','2','3','?','4.5']:
convert(i)
print(l)
#[1, 2, 3, 0, 4.5]
My goal is to convert this list of strings to a Numpy Array.
I want to convert the first 2 columns to numerical data (integer)
list1 = [['380850', '625105', 'Dota 2'],
['354804', '846193', "PLAYERUNKNOWN'S BATTLEGROUNDS"],
['204354', '467109', 'Counter-Strike: Global Offensive']
]
dt = np.dtype('i,i,U')
cast_array = np.array([tuple(row) for row in sl], dtype=dt)
print(cast_array)
The result is ...
[OUT] [(380850, 625105, '') (354804, 846193, '') (204354, 467109, '')]
I am losing the string data. I am interested in
Understanding why the string data is getting dropped
Finding any solution that converts the first 2 columns to integer type in a numpy array
This answer gave me the approach but doesn't seem to work for strings
Thanks to user: 9769953's comment above, this is the solution.
#when specifying strings you need to specify the length (derived from longest string in the list)
dtypestr = 'int, int, U' + str(max([len(i[2]) for i in plist1]))
cast_array = np.array([tuple(row) for row in plist1], dtype=dtypestr)
print(np.array(cast_array))
The simplest way to do that at high level is to use pandas, as said in comments, which will silently manage tricky problems :
In [64]: df=pd.DataFrame(list1)
In [65]: df2=df.apply(pd.to_numeric,errors='ignore')
In [66]: df2
Out[66]:
0 1 2
0 380850 625105 Dota 2
1 354804 846193 PLAYERUNKNOWN'S BATTLEGROUNDS
2 204354 467109 Counter-Strike: Global Offensive
In [67]: df2.dtypes
Out[67]:
0 int64
1 int64
2 object
dtype: object
df2.iloc[:,:2].values will be the numpy array, You can use all numpy accelerations on this part.
Your dtype is not what you expect it to be - you're running into https://github.com/numpy/numpy/issues/8969:
>>> dt = np.dtype('i,i,U')
>>> dt
dtype([('f0', '<i4'), ('f1', '<i4'), ('f2', '<U')])
>>> dt['f2'].itemsize
0 # 0-length strings!
You need to either specify a maximum number of characters
>>> dt = np.dtype('i,i,16U')
Or use an object type to store variable length strings:
>>> dt = np.dtype('i,i,O')
I want to use random.uniform to generate a float in between say [-2, 2], but not generate 0, this is how I do it in a loop,
from random import uniform
flag = True
while flag:
if uniform(-2, 2) is not 0:
flag = False
I am wondering is there better way to do it?
cheers
This is more something for Code Review, but very briefly:
from random import uniform
while True:
if uniform(-2, 2) != 0.0:
break
is probably the more Pythonic / standard way to do this (standard, as in that this pattern occurs in other languages as well).
It's rare that a flag variable is necessary to break out of a (while) loop. Perhaps when using a double loop.
Note: I changed your is not to !=, and your 0 to 0.0 (the latter is more so that it's clear we're comparing a float to a float).
Because you're comparing a float to an int, so they'll never be the same item. Besides, comparing numbers using is is a bad idea:
>>> 2*3 is 6 # this may work, but don't rely on it
True
>>> 10*60 is 600 # this obviously doesn't work
False
>>> 0 is 0 # sure, this works...
True
>>> 0.0 is 0 # but this doesn't: float vs int
False
Of course, to answer the actual question if there are other ways to generate those random numbers: probably a dozen.
With a list comprehension inside a list comprehension*:
[val for val in [uniform(-2, 2) for i in range(10)] if val != 0]
Using numpy:
vals = uniform(-2, 2, 10)
vals = vals[vals!=0]
* I don't want to call it nested, since I feel that belongs to a slightly different double list comprehension.