I have a dataframe below:
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': ['-3', '3', '1', '5']})
I am using groupby and sum.
final = df.groupby(['Product'])['Volume'].sum().reset_index()
print(final)
This is ok.
But I only want the print to be carry only those where sum != 0. Like Product C and D
Any idea how can I do that?
I try to use:
if final != 0:
print (final)
But this is throwing error and usually when I get this error, the syntax is definitely wrong...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Your data frame has Volume as strings, is that intended? if you want to sum it like numbers you have to convert it to numbers then you can apply the filter.
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': ['-3', '3', '1', '5']})
# convert from string to integers
df.Volume = df.Volume.map(lambda x: int(x))
final = df.groupby(['Product'])['Volume'].sum().reset_index()
#choose ones with sum none zero
print(final[final.Volume != 0])
it will print only the C & D
Given,
import pandas as pd
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': [-3, 3, 1, 5]})
final = df.groupby(['Product'])['Volume'].sum().reset_index()
Use selection to only select rows that match your criteria. df[some_series_of_booleans_based_on_condition]
print(final[final['Volume'] != 0])
#output:
Product Volume
1 C 1
2 D 5
The idea being that if [some series of booleans]: doesn't make sense for python to interpret, and thus it complains about the syntax with the message that you saw.
Related
I have a pandas series that looks like this:
import numpy as np
import string
import pandas as pd
np.random.seed(0)
data = np.random.randint(1,6,10)
index = list(string.ascii_lowercase)[:10]
a = pd.Series(data=data,index=index,name='apple')
a
>>>
a 5
b 1
c 4
d 4
e 4
f 2
g 4
h 3
i 5
j 1
Name: apple, dtype: int32
I want to group the series by its values and return a dict of of list of indices for those values i.e. this result:
{1: ['b', 'j'], 2: ['f'], 3: ['h'], 4: ['c', 'd', 'e', 'g'], 5: ['a', 'i']}
Here is how I achieve that at the moment:
b = a.reset_index().set_index('apple').squeeze()
grouped = b.groupby(level=0).apply(list).to_dict()
grouped
>>>
{1: ['b', 'j'], 2: ['f'], 3: ['h'], 4: ['c', 'd', 'e', 'g'], 5: ['a', 'i']}
However, it does not feel particularly pythonic to explicitly transform the series first so that I can get to the result. Is there a way to do this directly by applying a single function (ideally) or combination of functions in one line to achieve the same result?
Thanks!
You can use the groupby function and apply a lambda expression to it in order to get the desired result in one line:
grouped = a.groupby(a.values).apply(lambda x: list(x.index)).to_dict()
Alternatively, you could use the following:
grouped = dict(a.groupby(a.values).apply(lambda x: x.index.get_level_values(0)))
grouped = dict(a.groupby(a.values).apply(lambda x: x.index.tolist()))
I have the following data frame and list values
import pandas as pd
import numpy as np
df_merge = pd.DataFrame({'column1': ['a', 'c', 'e'],
'column2': ['b', 'd', 'f'],
'column3': [0.5, 0.6, .04],
'column4': [0.7, 0.8, 0.9]
})
bb = ['b','h']
dd = ['d', 'I']
ff = ['f', 'l']
I am trying to use np.where and np.select to instead of IF FUNCTION:
condition = [((df_merge['column1'] == 'a') & (df_merge['column2'] == df_merge['column2'].isin(bb))),((df_merge['column1'] == 'c') & (df_merge['column2'] == df_merge['column2'].isin(dd))), ((df_merge['column1'] == 'e') & (df_merge['column2'] == df_merge['column2'].
isin(ff)))]
choices1 = [((np.where(df_merge['column3'] >= 1, 'should not have, ','correct')) & (np.where(df_merge['column4'] >= 0.45, 'should not have, ','correct')))]
df_merge['Reason'] = np.select(condition, choices1, default='correct')
However, when i try to run the code line of choices1, i get the following error:
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Im am not sure if we can use np.where in choices as mentioned above.
np.where should be applied for both columns. Expected output as below:
df_merge = pd.DataFrame({'column1': ['a', 'c', 'e'],
'column2': ['b', 'd', 'f'],
'column3': [0.5, 0.6, .04],
'column4': [0.7, 0.8, 0.9],
'Reason': ['correct, should not have', 'correct, should not have', 'correct, should not have'],
})
Any help / guidance / alternative is much appreciated.
First length of condition list has to be same like choices1, so last condition is commented (removed) for length 2.
Then if compare by isin output is condition (mask), so compare with column has no sense.
Last problem was need list of length 2, so replaced & to , and removed parantheses in choices1 list for avoid tuples:
condition = [(df_merge['column1'] == 'a') & df_merge['column2'].isin(bb),
(df_merge['column1'] == 'c') & df_merge['column2'].isin(dd)
# (df_merge['column1'] == 'e') & df_merge['column2'].isin(ff),
]
choices1 = [np.where(df_merge['column3'] >= 1, 'should not have','correct'),
np.where(df_merge['column4'] >= 0.45, 'should not have','correct')]
df_merge['Reason'] = np.select(condition, choices1, default='correct')
print (df_merge)
column1 column2 column3 column4 Reason
0 a b 0.50 0.7 correct
1 c d 0.60 0.8 should not have
2 e f 0.04 0.9 correct
Here I am attempting to query a column in dataframe df, which has boolean values 'Yes' or 'No', in order to perform some function of random letter assignment according to a probability distribution in rows where the condition is met.
if (df['some_bool'] == 'Yes'):
df['score'] = np.random.choice(['A', 'B', 'C'], len(df), p=[0.3, 0.2, 0.5])
What is a correct way of writing this as I receive the following error for the above:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Thanks!
Try this instead:
df['score'] = np.where(df['some_bool'] == 'Yes',
np.random.choice(['A', 'B', 'C'], len(df), p=[0.3, 0.2, 0.5]), '')
I have a dataframe which is containing 3 columns (['A','B','C]) and 3 rows in it.
We are using a for loop to fetch value(storing into variable) from above dataframe based upon certain condition from column B.
Further we are using list to store value present in variable.
Here question is upon checking list value, we are getting variable value, its type.
I'm not sure why it is happening. As list should contain only variable value only.
Please can anyone help us to get ideal solution for same.
Thanks,
Bhuwan
dataframe: columns-A,B,C rows value- a to i :df = ([a,b,c][d,b,f][g,b,i]).
list_1=[]
for i in range(0,9):
variable_1=df['A'][df.B == 'b']
list_1.append(variable_1)
print(list_1):
Ideal output: ['a','d','g']
while we are getting output as
['a type: object','d type: object','g type: object'].
You can get your ideal output like this:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'd', 'g'], 'B': ['b', 'b', 'b'], 'C': ['c', 'f', 'i']})
list_1 = list(df[df['B'] == 'b']['A'].values) # <- this line
print(list_1)
> ['a', 'd', 'g']
You just need:
1) to filter your dataframe by column "B" df[df['B'] == 'b']
2) and only then take values of the resulted column "A", turning them into list
df = pd.DataFrame([[3,3,3]]*4,index=['a','b','c','d'])
While we can extract a copy of a section of an index via specifying row numbers like below:
i1=df.index[1:3].copy()
Unfortunately we can't extract a copy of a section of an index via specifying the key (like the case of df.loc method). When I try the below:
i2=df.index['a':'c'].copy()
I get the below error:
TypeError: slice indices must be integers or None or have an __index__ method
Is there any alternative to call a subset of an index based on its keys? Thank you
Simpliest is loc with index:
i1 = df.loc['b':'c'].index
print (i1)
Index(['b', 'c'], dtype='object')
Or is possible use get_loc for positions:
i1 = df.index
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c'], dtype='object')
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Alternative:
i1 = i1[i1.searchsorted('b') : i1.searchsorted('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Try using .loc, see this documentation:
i2 = df.loc['a':'c'].index
print(i2)
Output:
Index(['a', 'b', 'c'], dtype='object')
or
df.loc['a':'c'].index.tolist()
Output:
['a', 'b', 'c']