Having problem using comparison operator in pandas groupby function

Having problem using comparison operator in pandas groupby function - python-3.x

I'm stuck on following error on line 4: " ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). " , when I'm trying to get the max revenue among all shops in the month of September, 2014.
amt=[]
group=df.groupby(["shop_id"])
for i,j in group:
if ((j["month"] == '9') & (j["year"] == '2014')): #line 4 - error
#print(i)
#print(j)
#print(x)
x=np.sum(j['item_cnt_day']*j["item_price"])
print("revenue: ",x)
amt.append(x)

Related

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() while using model.fit

def train(self):
self.model.fit([self.graph_train, self.pattern_train], self.y_train batch_size=self.batch_size,epochs=self.epochs,
class_weight=self.class_weight)
Getting Error like this

Apply a fucntion with lambda to Pandas

I have 2 time-series and I would like to find the nearest date from each date in time-series1 to time-series2. I found how to do it separately per date, but I would like to apply it to the entire time-series1. They are in two different dataframes called o and p
This is how my data looks like:
Time-series1:
o['date']
>>>0 2020-01-26
1 2020-01-28
2 2020-01-31
3 2020-02-15
4 2020-02-17
...
86 2021-01-10
87 2021-01-20
88 2021-01-27
89 2021-01-30
90 2021-02-14
Name: date, Length: 91, dtype: datetime64[ns]
Time-series2:
p['date']
>>>1 2020-02-17
3 2020-03-02
4 2020-03-03
5 2020-03-04
6 2020-03-05
...
172 2021-01-30
173 2021-02-06
174 2021-02-07
177 2021-02-12
179 2021-02-14
Name: date, Length: 144, dtype: datetime64[ns]
The function that I use:
def nearest(pivot,items):
return min(items, key=lambda x: abs(x - pivot))
Which works on a separate singular date, for example:
nearest(o['date'][6], p['date'])
>>>Timestamp('2020-03-02 00:00:00')
When I try to apply it to the whole pandas Series I get an error:
o['date'].apply(nearest, args=(p['date']))
>>>---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-82c86ffd48ff> in <module>()
----> 1 o['date'].apply(nearest, args=(p['date']))
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4180
4181 # handle ufuncs and lambdas
-> 4182 if kwds or args and not isinstance(func, np.ufunc):
4183
4184 def f(x):
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1325 def __nonzero__(self):
1326 raise ValueError(
-> 1327 f"The truth value of a {type(self).__name__} is ambiguous. "
1328 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1329 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I feel that there is something basic I'm missing.
I guess I can do:
[nearest(x, p['date']) for x in o['date']]
But I would like to know how to apply it to a Pandas Series

Use Series.apply with lambda function:
s = o['date'].apply(lambda x: nearest(x, p['date']))
Or with args parameter:
s = o['date'].apply(nearest, args=(p['date'], ))
Numpy alternative with numpy.argmin should be faster:
a = o['date'].to_numpy()
b = p['date'].to_numpy()
pos = np.argmin(np.abs(a- b[:, None]), axis=0)
s = pd.Series(b[pos], index=o.index)

How to plot activities by hour based on dataset in ggplot?

I'm trying to create code that will plot the number of entries into the MTA by hour. I have a csv dataset of entries and the hours they occur. I've reduced the dataset to a Pandas dataframe with just the entries, and I've added a column for "Hour" to show the time that the entries occur.
However, when plotting, I keep getting a "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." and I'm unclear what this means, or why it's happening.
I've tried adjusting the "Hour" column to a datetime; I've tried using the index instead of the "Hour" column in the dataframe.
from ggplot import *
import pandas as pd
turnstile_weather = pd.read_csv('/home/pi/Documents/Data analysis/turnstile_data_master_with_weather.csv')
def plot_weather_data(turnstile_weather):
entries_by_hour = pd.DataFrame(turnstile_weather['ENTRIESn_hourly'] \
.groupby(turnstile_weather['Hour']).sum())
entries_by_hour['Hour'] = set(turnstile_weather['Hour'])
plot = ggplot(entries_by_hour, \
aes(entries_by_hour['Hour'], entries_by_hour['ENTRIESn_hourly'])) \
+ geom_line()
print(plot)
plot_weather_data(turnstile_weather)
I expect to get a line chart, with Hours as the X-axis and Entries by Hour as the Y-axis, but I get an error instead:
ValueError Traceback (most recent call last)
<ipython-input-9-3cf39740bb64> in <module>
10 print(plot)
11
---> 12 plot_weather_data(turnstile_weather)
<ipython-input-9-3cf39740bb64> in plot_weather_data(turnstile_weather)
7 entries_by_hour = pd.DataFrame(turnstile_weather['ENTRIESn_hourly'].groupby(turnstile_weather['Hour']).sum())
8 entries_by_hour['Hour'] = set(turnstile_weather['Hour'])
----> 9 plot = ggplot(entries_by_hour, aes(entries_by_hour['Hour'], entries_by_hour['ENTRIESn_hourly'])) + geom_line()
10 print(plot)
11
/usr/local/lib/python3.5/dist-packages/ggplot/ggplot.py in __init__(self, aesthetics, data)
53 self._aes = aesthetics
54 self.data = data.copy()
---> 55 self._handle_index()
56 self.data = self._aes._evaluate_expressions(self.data)
57 self.data = self._aes.handle_identity_values(self.data)
/usr/local/lib/python3.5/dist-packages/ggplot/ggplot.py in _handle_index(self)
132
133 def _handle_index(self):
--> 134 if '__index__' in self._aes.values():
135 self.data['__index__'] = self.data.index
136
/usr/lib/python3.5/_collections_abc.py in __contains__(self, value)
688 def __contains__(self, value):
689 for key in self._mapping:
--> 690 if value == self._mapping[key]:
691 return True
692 return False
~/.local/lib/python3.5/site-packages/pandas/core/generic.py in __nonzero__(self)
1476 raise ValueError("The truth value of a {0} is ambiguous. "
1477 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1478 .format(self.__class__.__name__))
1479
1480 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Annotate seaborn clustermap with Pandas Dataframe

I am using seaborn (v.0.7.0) to plot a heat-map. Here is my code:
Updated code after fixing the problem
### Get Data
sns.set(style="white")
adata = pd.read_csv("Test.txt", sep="\t",index_col=0)
adata_log = np.log2(adata)
e = adata.iloc[0:7,0:3]
e_log = adata_log.iloc[0:7,0:3]
#### Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
#### Set color
hmcol = ["#ffffff","#ffffff","#fbe576","#c06e36","#9a2651"]
cmap = sns.blend_palette(hmcol,as_cmap=True)
#### Plot clustermap
sns.set(font_scale=0.8) ## 0.8 for normal use
aplot = sns.clustermap(e_log,cmap=cmap,method='average', metric='euclidean',standard_scale=None,row_cluster=False,col_cluster=False,row_linkage=None,col_linkage=None,linewidths=.05,square=True,annot=e,annot_kws={"size": 15},fmt='.2f')
aplot.cax.set_visible(False) #remove color bar
plt.setp(aplot.ax_heatmap.xaxis.get_majorticklabels(), rotation=90) ## Y-Axis label rotations
plt.setp(aplot.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) ## X-Axis label rotations
##Save Figure
aplot.savefig(“Test-Fig1.0.pdf",orientation='potrait',dpi=600)
Is there any way I can use values in the dataframe ‘e’ as annotations? I tried
annot=e
in clustermap but its giving me an error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Also, it there anyway I plot the figure in landscape mode? Here is the data and figure from above code:
print(e)
X Y Z
A 100.72 90.20 13.58
B 160.98 162.24 12.85
C 6.76 8.03 0.66
D 241.49 277.89 29.43
E 156.78 145.54 30.72
F 6.09 5.96 0.93
G 4.57 1.16 0.74

Upgrading 'seaborn' to v0.7.1 solved the annotation problem. I have updated my answer with fix.
For plotting in landscape, I guess easiest way would be to change my data from tall to long order i.e. change the input test file file by copying data and pasting as transposed.
Bade

MatPlotLib Key Error: 0 When Annotating

In this bit of code, I am trying to create a bar chart with one bar (as the "total" bar which will be placed next to other bars) with a slice of a data frame. It works for the other slice of the data frame, but for the "total" slice (which is just one row), I keep getting this "Key Error: 0":
x_4t = dfCOPN['Percent'][-1:]
y_4t = dfCOPN['index'][-1:]
ind4t = np.arange(len(y_4t))
...
for i, text in enumerate(ind4t):
if x_4t<72:
ax4t.annotate(str(x_4t)[:-2]+"%", xy=(x_4t+2,ind4t+0.4),fontsize=9, color='black', va='center', ha='left')
elif x_4t>=72:
ax4t.annotate(str(x_4t[i])[:-2]+"%", xy=(x_4t[i]-2,ind4t[i]+0.4),fontsize=9, color='white', va='center', ha='right')
Here's the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-e192a1699cc5> in <module>()
174 # for i, text in enumerate(ind4t):
--> 175 if x_4t<72:
176 ax4t.annotate(str(x_4t)[:-2]+"%", xy=(x_4t+2,ind4t+0.4),fontsize=9, color='black', va='center', ha='left')
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
729 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 730 .format(self.__class__.__name__))
731
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-42-e192a1699cc5> in <module>()
185 except:
186 for i, text in enumerate(ind4t):
--> 187 ax4t.annotate(str(x_4t[i])[:-2]+"%", xy=(x_4t[i]+2,ind4t[i]+0.4),fontsize=9, color='black', va='center', ha='left')
188 for i, text in enumerate(ind5t):
189 ax5t.annotate(str(x_5t[i])[:-2]+"%", xy=(x_5t[i]+2,ind5t[i]+0.4),fontsize=9, color='black', va='center', ha='left')
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
549 def __getitem__(self, key):
550 try:
--> 551 result = self.index.get_value(self, key)
552
553 if not np.isscalar(result):
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\index.py in get_value(self, series, key)
1721
1722 try:
-> 1723 return self._engine.get_value(s, k)
1724 except KeyError as e1:
1725 if len(self) > 0 and self.inferred_type in ['integer','boolean']:
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3204)()
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2903)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3843)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6525)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6463)()
KeyError: 0
I'll send a complete example if needed, but I just wanted to see if the answer is obvious enough to not do so.
Thanks in advance!

TL;DR; use iloc
x4t.iloc[0]
The problem is that pd.Series use index not position indexing via []. This is a powerful idea/bit of syntax as frequently when you are working with a Series the value of the index is the interesting thing, not the position in the Series (ex a date index), however if you expect that Series / DataFrames will behave exactly like a numpy array you will have issues (I speak from experience here).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Having problem using comparison operator in pandas groupby function - python-3.x

Related

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() while using model.fit

Apply a fucntion with lambda to Pandas

How to plot activities by hour based on dataset in ggplot?

Annotate seaborn clustermap with Pandas Dataframe

MatPlotLib Key Error: 0 When Annotating

Categories

Resources