I'm trying to create code that will plot the number of entries into the MTA by hour. I have a csv dataset of entries and the hours they occur. I've reduced the dataset to a Pandas dataframe with just the entries, and I've added a column for "Hour" to show the time that the entries occur.
However, when plotting, I keep getting a "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." and I'm unclear what this means, or why it's happening.
I've tried adjusting the "Hour" column to a datetime; I've tried using the index instead of the "Hour" column in the dataframe.
from ggplot import *
import pandas as pd
turnstile_weather = pd.read_csv('/home/pi/Documents/Data analysis/turnstile_data_master_with_weather.csv')
def plot_weather_data(turnstile_weather):
entries_by_hour = pd.DataFrame(turnstile_weather['ENTRIESn_hourly'] \
.groupby(turnstile_weather['Hour']).sum())
entries_by_hour['Hour'] = set(turnstile_weather['Hour'])
plot = ggplot(entries_by_hour, \
aes(entries_by_hour['Hour'], entries_by_hour['ENTRIESn_hourly'])) \
+ geom_line()
print(plot)
plot_weather_data(turnstile_weather)
I expect to get a line chart, with Hours as the X-axis and Entries by Hour as the Y-axis, but I get an error instead:
ValueError Traceback (most recent call last)
<ipython-input-9-3cf39740bb64> in <module>
10 print(plot)
11
---> 12 plot_weather_data(turnstile_weather)
<ipython-input-9-3cf39740bb64> in plot_weather_data(turnstile_weather)
7 entries_by_hour = pd.DataFrame(turnstile_weather['ENTRIESn_hourly'].groupby(turnstile_weather['Hour']).sum())
8 entries_by_hour['Hour'] = set(turnstile_weather['Hour'])
----> 9 plot = ggplot(entries_by_hour, aes(entries_by_hour['Hour'], entries_by_hour['ENTRIESn_hourly'])) + geom_line()
10 print(plot)
11
/usr/local/lib/python3.5/dist-packages/ggplot/ggplot.py in __init__(self, aesthetics, data)
53 self._aes = aesthetics
54 self.data = data.copy()
---> 55 self._handle_index()
56 self.data = self._aes._evaluate_expressions(self.data)
57 self.data = self._aes.handle_identity_values(self.data)
/usr/local/lib/python3.5/dist-packages/ggplot/ggplot.py in _handle_index(self)
132
133 def _handle_index(self):
--> 134 if '__index__' in self._aes.values():
135 self.data['__index__'] = self.data.index
136
/usr/lib/python3.5/_collections_abc.py in __contains__(self, value)
688 def __contains__(self, value):
689 for key in self._mapping:
--> 690 if value == self._mapping[key]:
691 return True
692 return False
~/.local/lib/python3.5/site-packages/pandas/core/generic.py in __nonzero__(self)
1476 raise ValueError("The truth value of a {0} is ambiguous. "
1477 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1478 .format(self.__class__.__name__))
1479
1480 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Related
I am new to pandas and still learning.
I am trying to add two numbers in series label-wise. One method is this:
numbers = pd.Series(np.random.randint(0,1000,10000))
for label, value in numbers.iteritems():
numbers.set_values(label, value+2)
numbers.head()
Output:
AttributeError: 'Series' object has no attribute 'set_values'
Now upon research I found out that it was deprecated and at() is used instead.
so when I used it like this:
for label, value in numbers.iteritems():
numbers.at(label, value+2)
numbers.head()
Output:
TypeError: '_AtIndexer' object is not callable
So when I use it like this with at[]:
for label, value in numbers.iteritems():
numbers.at[label, value+2]
numbers.head()
I get this output:
KeyError Traceback (most recent call last)
<ipython-input-43-b1f985a669d7> in <module>
1 for label, value in numbers.iteritems():
----> 2 numbers.at[label, value+2]
3
4 numbers.head()
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
2078 return self.obj.loc[key]
2079
-> 2080 return super().__getitem__(key)
2081
2082 def __setitem__(self, key, value):
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
2025
2026 key = self._convert_key(key)
-> 2027 return self.obj._get_value(*key, takeable=self._takeable)
2028
2029 def __setitem__(self, key, value):
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
987
988 # Similar to Index.get_value, but we do not fall back to positional
--> 989 loc = self.index.get_loc(label)
990 return self.index._get_values_for_loc(self, loc, label)
991
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
356 except ValueError as err:
357 raise KeyError(key) from err
--> 358 raise KeyError(key)
359 return super().get_loc(key, method=method, tolerance=tolerance)
360
KeyError: (0, 10002)
What am I doing wrong and what can be fixed?
at is used as an accessor to the index given by its argument. When you pass label, value+2 to it, it sees this argument as a 2-tuple and looks for an index named literally (0, 10002) in the first turn but it fails since your series has integer indices 0, 1, ..., not tuples.
So you can leave label only in at and set what it returns to value + 2:
numbers = pd.Series(np.random.randint(0,1000,10000))
for label, value in numbers.iteritems():
# lookup the value and set it
numbers.at[label] = value + 2
(noting that this is equivalent to numbers += 2).
I have 2 time-series and I would like to find the nearest date from each date in time-series1 to time-series2. I found how to do it separately per date, but I would like to apply it to the entire time-series1. They are in two different dataframes called o and p
This is how my data looks like:
Time-series1:
o['date']
>>>0 2020-01-26
1 2020-01-28
2 2020-01-31
3 2020-02-15
4 2020-02-17
...
86 2021-01-10
87 2021-01-20
88 2021-01-27
89 2021-01-30
90 2021-02-14
Name: date, Length: 91, dtype: datetime64[ns]
Time-series2:
p['date']
>>>1 2020-02-17
3 2020-03-02
4 2020-03-03
5 2020-03-04
6 2020-03-05
...
172 2021-01-30
173 2021-02-06
174 2021-02-07
177 2021-02-12
179 2021-02-14
Name: date, Length: 144, dtype: datetime64[ns]
The function that I use:
def nearest(pivot,items):
return min(items, key=lambda x: abs(x - pivot))
Which works on a separate singular date, for example:
nearest(o['date'][6], p['date'])
>>>Timestamp('2020-03-02 00:00:00')
When I try to apply it to the whole pandas Series I get an error:
o['date'].apply(nearest, args=(p['date']))
>>>---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-82c86ffd48ff> in <module>()
----> 1 o['date'].apply(nearest, args=(p['date']))
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4180
4181 # handle ufuncs and lambdas
-> 4182 if kwds or args and not isinstance(func, np.ufunc):
4183
4184 def f(x):
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1325 def __nonzero__(self):
1326 raise ValueError(
-> 1327 f"The truth value of a {type(self).__name__} is ambiguous. "
1328 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1329 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I feel that there is something basic I'm missing.
I guess I can do:
[nearest(x, p['date']) for x in o['date']]
But I would like to know how to apply it to a Pandas Series
Use Series.apply with lambda function:
s = o['date'].apply(lambda x: nearest(x, p['date']))
Or with args parameter:
s = o['date'].apply(nearest, args=(p['date'], ))
Numpy alternative with numpy.argmin should be faster:
a = o['date'].to_numpy()
b = p['date'].to_numpy()
pos = np.argmin(np.abs(a- b[:, None]), axis=0)
s = pd.Series(b[pos], index=o.index)
I'm stuck on following error on line 4: " ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). " , when I'm trying to get the max revenue among all shops in the month of September, 2014.
amt=[]
group=df.groupby(["shop_id"])
for i,j in group:
if ((j["month"] == '9') & (j["year"] == '2014')): #line 4 - error
#print(i)
#print(j)
#print(x)
x=np.sum(j['item_cnt_day']*j["item_price"])
print("revenue: ",x)
amt.append(x)
I have a dataframe with two columns: 'case' and 'datetime' .
index------ case----------outDateTime
71809----10180227.0-----2013-01-01 01:41:01
71810----10180229.0-----2013-01-01 04:20:05
71811----10180230.0-----2013-01-01 06:20:22
575------10180232.0-----2013-01-01 02:01:13
23757----10180233.0-----2013-01-01 01:48:49
My goal is to count the amount of cases that there are in a certain hour of every day. For this data it would be:
2013-01-01 01AM = 2
2013-01-01 02AM = 1
2013-01-01 03AM = 0
2013-01-01 04AM = 1
and so on
I wanted to use df.resample('H').agg(), but I get the following error :
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
I find it weird because I used to_datetime and set_index so my dataframes index is the datetime and the type is datetime .
My code:
pd.to_datetime(Dout['outDateTime'],dayfirst=True)
Dout.set_index('outDateTime',inplace=True)
Dout.isnull().values.any()
false
Dout = Dout.resample('H').agg()
-------------------------------------------------------------------
--------
TypeError Traceback (most recent
call last)
<ipython-input-196-75f9fbadd6cc> in <module>
----> 1 Dout = Dout.resample('H').agg()
~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in
resample(self, rule, how, axis, fill_method, closed, label,
convention, kind, loffset, limit, base, on, level)
7108 axis=axis, kind=kind, loffset=loffset,
7109 convention=convention,
-> 7110 base=base, key=on, level=level)
7111 return _maybe_process_deprecations(r,
7112 how=how,
~/anaconda3/lib/python3.7/site-packages/pandas/core/resample.py in
resample(obj, kind, **kwds)
1146 """ create a TimeGrouper and return our resampler """
1147 tg = TimeGrouper(**kwds)
-> 1148 return tg._get_resampler(obj, kind=kind)
1149
1150
~/anaconda3/lib/python3.7/site-packages/pandas/core/resample.py in
_get_resampler(self, obj, kind)
1274 raise TypeError("Only valid with DatetimeIndex, "
1275 "TimedeltaIndex or PeriodIndex, "
-> 1276 "but got an instance of %r" %
type(ax).__name__)
1277
1278 def _get_grouper(self, obj, validate=True):
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of 'Index'
I think you are looking for:
df.set_index('outDateTime').resample('H').size()
In this bit of code, I am trying to create a bar chart with one bar (as the "total" bar which will be placed next to other bars) with a slice of a data frame. It works for the other slice of the data frame, but for the "total" slice (which is just one row), I keep getting this "Key Error: 0":
x_4t = dfCOPN['Percent'][-1:]
y_4t = dfCOPN['index'][-1:]
ind4t = np.arange(len(y_4t))
...
for i, text in enumerate(ind4t):
if x_4t<72:
ax4t.annotate(str(x_4t)[:-2]+"%", xy=(x_4t+2,ind4t+0.4),fontsize=9, color='black', va='center', ha='left')
elif x_4t>=72:
ax4t.annotate(str(x_4t[i])[:-2]+"%", xy=(x_4t[i]-2,ind4t[i]+0.4),fontsize=9, color='white', va='center', ha='right')
Here's the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-e192a1699cc5> in <module>()
174 # for i, text in enumerate(ind4t):
--> 175 if x_4t<72:
176 ax4t.annotate(str(x_4t)[:-2]+"%", xy=(x_4t+2,ind4t+0.4),fontsize=9, color='black', va='center', ha='left')
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
729 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 730 .format(self.__class__.__name__))
731
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-42-e192a1699cc5> in <module>()
185 except:
186 for i, text in enumerate(ind4t):
--> 187 ax4t.annotate(str(x_4t[i])[:-2]+"%", xy=(x_4t[i]+2,ind4t[i]+0.4),fontsize=9, color='black', va='center', ha='left')
188 for i, text in enumerate(ind5t):
189 ax5t.annotate(str(x_5t[i])[:-2]+"%", xy=(x_5t[i]+2,ind5t[i]+0.4),fontsize=9, color='black', va='center', ha='left')
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
549 def __getitem__(self, key):
550 try:
--> 551 result = self.index.get_value(self, key)
552
553 if not np.isscalar(result):
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\index.py in get_value(self, series, key)
1721
1722 try:
-> 1723 return self._engine.get_value(s, k)
1724 except KeyError as e1:
1725 if len(self) > 0 and self.inferred_type in ['integer','boolean']:
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3204)()
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2903)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3843)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6525)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6463)()
KeyError: 0
I'll send a complete example if needed, but I just wanted to see if the answer is obvious enough to not do so.
Thanks in advance!
TL;DR; use iloc
x4t.iloc[0]
The problem is that pd.Series use index not position indexing via []. This is a powerful idea/bit of syntax as frequently when you are working with a Series the value of the index is the interesting thing, not the position in the Series (ex a date index), however if you expect that Series / DataFrames will behave exactly like a numpy array you will have issues (I speak from experience here).