I have 2 time-series and I would like to find the nearest date from each date in time-series1 to time-series2. I found how to do it separately per date, but I would like to apply it to the entire time-series1. They are in two different dataframes called o and p
This is how my data looks like:
Time-series1:
o['date']
>>>0 2020-01-26
1 2020-01-28
2 2020-01-31
3 2020-02-15
4 2020-02-17
...
86 2021-01-10
87 2021-01-20
88 2021-01-27
89 2021-01-30
90 2021-02-14
Name: date, Length: 91, dtype: datetime64[ns]
Time-series2:
p['date']
>>>1 2020-02-17
3 2020-03-02
4 2020-03-03
5 2020-03-04
6 2020-03-05
...
172 2021-01-30
173 2021-02-06
174 2021-02-07
177 2021-02-12
179 2021-02-14
Name: date, Length: 144, dtype: datetime64[ns]
The function that I use:
def nearest(pivot,items):
return min(items, key=lambda x: abs(x - pivot))
Which works on a separate singular date, for example:
nearest(o['date'][6], p['date'])
>>>Timestamp('2020-03-02 00:00:00')
When I try to apply it to the whole pandas Series I get an error:
o['date'].apply(nearest, args=(p['date']))
>>>---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-82c86ffd48ff> in <module>()
----> 1 o['date'].apply(nearest, args=(p['date']))
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4180
4181 # handle ufuncs and lambdas
-> 4182 if kwds or args and not isinstance(func, np.ufunc):
4183
4184 def f(x):
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1325 def __nonzero__(self):
1326 raise ValueError(
-> 1327 f"The truth value of a {type(self).__name__} is ambiguous. "
1328 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1329 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I feel that there is something basic I'm missing.
I guess I can do:
[nearest(x, p['date']) for x in o['date']]
But I would like to know how to apply it to a Pandas Series
Use Series.apply with lambda function:
s = o['date'].apply(lambda x: nearest(x, p['date']))
Or with args parameter:
s = o['date'].apply(nearest, args=(p['date'], ))
Numpy alternative with numpy.argmin should be faster:
a = o['date'].to_numpy()
b = p['date'].to_numpy()
pos = np.argmin(np.abs(a- b[:, None]), axis=0)
s = pd.Series(b[pos], index=o.index)
Related
I am doing data preprocessing, so I am trying to convert the date string format into an int, but I got an error, please help me how to convert it.
I have data like this :
0 Apr-12
1 Apr-12
2 Mar-12
3 Apr-12
4 Apr-12
I tried this :
d=df['d_date'].apply(lambda x: datetime.strptime(x, '%m%Y'))
I got an error.
ValueError Traceback (most recent call last)
<ipython-input-134-173081812744> in <module>()
----> 1 d=test['first_payment_date'].apply(lambda x: datetime.strptime(x, '%m%Y'))
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4036 else:
4037 values = self.astype(object).values
-> 4038 mapped = lib.map_infer(values, f, convert=convert_dtype)
4039
4040 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-134-173081812744> in <lambda>(x)
----> 1 d=test['first_payment_date'].apply(lambda x: datetime.strptime(x, '%m%Y'))
~\Anaconda3\lib\_strptime.py in _strptime_datetime(cls, data_string, format)
563 """Return a class cls instance based on the input string and the
564 format string."""
--> 565 tt, fraction = _strptime(data_string, format)
566 tzname, gmtoff = tt[-2:]
567 args = tt[:6] + (fraction,)
~\Anaconda3\lib\_strptime.py in _strptime(data_string, format)
360 if not found:
361 raise ValueError("time data %r does not match format %r" %
--> 362 (data_string, format))
363 if len(data_string) != found.end():
364 raise ValueError("unconverted data remains: %s" %
ValueError: time data 'Apr12' does not match format '%m%Y'
IIUC, You need to se %b-%y as Apr is %b and 12 is %y. Refer to Python's strftime directives for more information. Once you convert to datetime objects, you can then convert them to UNIX.
df:
col
0 Apr-12
1 Apr-12
For int datetime,
pd.Series(pd.to_datetime(df['col'], format='%b-%y').values.astype(float)).div(10**9)
Output:
0 1.333238e+09
1 1.333238e+09
dtype: float64
Explanation:
pd.to_datetime(df['col'], format='%b-%y')
Outputs:
0 2012-04-01
1 2012-04-01
Name: col, dtype: datetime64[ns]
I'm trying to create code that will plot the number of entries into the MTA by hour. I have a csv dataset of entries and the hours they occur. I've reduced the dataset to a Pandas dataframe with just the entries, and I've added a column for "Hour" to show the time that the entries occur.
However, when plotting, I keep getting a "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." and I'm unclear what this means, or why it's happening.
I've tried adjusting the "Hour" column to a datetime; I've tried using the index instead of the "Hour" column in the dataframe.
from ggplot import *
import pandas as pd
turnstile_weather = pd.read_csv('/home/pi/Documents/Data analysis/turnstile_data_master_with_weather.csv')
def plot_weather_data(turnstile_weather):
entries_by_hour = pd.DataFrame(turnstile_weather['ENTRIESn_hourly'] \
.groupby(turnstile_weather['Hour']).sum())
entries_by_hour['Hour'] = set(turnstile_weather['Hour'])
plot = ggplot(entries_by_hour, \
aes(entries_by_hour['Hour'], entries_by_hour['ENTRIESn_hourly'])) \
+ geom_line()
print(plot)
plot_weather_data(turnstile_weather)
I expect to get a line chart, with Hours as the X-axis and Entries by Hour as the Y-axis, but I get an error instead:
ValueError Traceback (most recent call last)
<ipython-input-9-3cf39740bb64> in <module>
10 print(plot)
11
---> 12 plot_weather_data(turnstile_weather)
<ipython-input-9-3cf39740bb64> in plot_weather_data(turnstile_weather)
7 entries_by_hour = pd.DataFrame(turnstile_weather['ENTRIESn_hourly'].groupby(turnstile_weather['Hour']).sum())
8 entries_by_hour['Hour'] = set(turnstile_weather['Hour'])
----> 9 plot = ggplot(entries_by_hour, aes(entries_by_hour['Hour'], entries_by_hour['ENTRIESn_hourly'])) + geom_line()
10 print(plot)
11
/usr/local/lib/python3.5/dist-packages/ggplot/ggplot.py in __init__(self, aesthetics, data)
53 self._aes = aesthetics
54 self.data = data.copy()
---> 55 self._handle_index()
56 self.data = self._aes._evaluate_expressions(self.data)
57 self.data = self._aes.handle_identity_values(self.data)
/usr/local/lib/python3.5/dist-packages/ggplot/ggplot.py in _handle_index(self)
132
133 def _handle_index(self):
--> 134 if '__index__' in self._aes.values():
135 self.data['__index__'] = self.data.index
136
/usr/lib/python3.5/_collections_abc.py in __contains__(self, value)
688 def __contains__(self, value):
689 for key in self._mapping:
--> 690 if value == self._mapping[key]:
691 return True
692 return False
~/.local/lib/python3.5/site-packages/pandas/core/generic.py in __nonzero__(self)
1476 raise ValueError("The truth value of a {0} is ambiguous. "
1477 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1478 .format(self.__class__.__name__))
1479
1480 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I have a dataframe with two columns: 'case' and 'datetime' .
index------ case----------outDateTime
71809----10180227.0-----2013-01-01 01:41:01
71810----10180229.0-----2013-01-01 04:20:05
71811----10180230.0-----2013-01-01 06:20:22
575------10180232.0-----2013-01-01 02:01:13
23757----10180233.0-----2013-01-01 01:48:49
My goal is to count the amount of cases that there are in a certain hour of every day. For this data it would be:
2013-01-01 01AM = 2
2013-01-01 02AM = 1
2013-01-01 03AM = 0
2013-01-01 04AM = 1
and so on
I wanted to use df.resample('H').agg(), but I get the following error :
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
I find it weird because I used to_datetime and set_index so my dataframes index is the datetime and the type is datetime .
My code:
pd.to_datetime(Dout['outDateTime'],dayfirst=True)
Dout.set_index('outDateTime',inplace=True)
Dout.isnull().values.any()
false
Dout = Dout.resample('H').agg()
-------------------------------------------------------------------
--------
TypeError Traceback (most recent
call last)
<ipython-input-196-75f9fbadd6cc> in <module>
----> 1 Dout = Dout.resample('H').agg()
~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in
resample(self, rule, how, axis, fill_method, closed, label,
convention, kind, loffset, limit, base, on, level)
7108 axis=axis, kind=kind, loffset=loffset,
7109 convention=convention,
-> 7110 base=base, key=on, level=level)
7111 return _maybe_process_deprecations(r,
7112 how=how,
~/anaconda3/lib/python3.7/site-packages/pandas/core/resample.py in
resample(obj, kind, **kwds)
1146 """ create a TimeGrouper and return our resampler """
1147 tg = TimeGrouper(**kwds)
-> 1148 return tg._get_resampler(obj, kind=kind)
1149
1150
~/anaconda3/lib/python3.7/site-packages/pandas/core/resample.py in
_get_resampler(self, obj, kind)
1274 raise TypeError("Only valid with DatetimeIndex, "
1275 "TimedeltaIndex or PeriodIndex, "
-> 1276 "but got an instance of %r" %
type(ax).__name__)
1277
1278 def _get_grouper(self, obj, validate=True):
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of 'Index'
I think you are looking for:
df.set_index('outDateTime').resample('H').size()
So I have a dataframe X which looks something like this:
X.head()
0 My wife took me here on my birthday for breakf...
1 I have no idea why some people give bad review...
3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4 General Manager Scott Petello is a good egg!!!...
6 Drop what you're doing and drive here. After I...
Name: text, dtype: object
And then,
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(X)
But I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-61-8ff79b91e317> in <module>()
----> 1 X = cv.fit_transform(X)
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
790 for doc in raw_documents:
791 feature_counter = {}
--> 792 for feature in analyze(doc):
793 try:
794 feature_idx = vocabulary[feature]
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
264
265 return lambda doc: self._word_ngrams(
--> 266 tokenize(preprocess(self.decode(doc))), stop_words)
267
268 else:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(x)
230
231 if self.lowercase:
--> 232 return lambda x: strip_accents(x.lower())
233 else:
234 return strip_accents
~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __getattr__(self, attr)
574 return self.getnnz()
575 else:
--> 576 raise AttributeError(attr + " not found")
577
578 def transpose(self, axes=None, copy=False):
AttributeError: lower not found
No idea why.
You need to specify the column name of the text data even if the dataframe has single column.
X_countMatrix = cv.fit_transform(X['text'])
Because a CountVectorizer expects an iterable as input and when you supply a dataframe as an argument, only thing thats iterated is the column names. So even if you did not have any errors, that would be incorrect. Lucky that you got an error and got a chance to correct it.
In this bit of code, I am trying to create a bar chart with one bar (as the "total" bar which will be placed next to other bars) with a slice of a data frame. It works for the other slice of the data frame, but for the "total" slice (which is just one row), I keep getting this "Key Error: 0":
x_4t = dfCOPN['Percent'][-1:]
y_4t = dfCOPN['index'][-1:]
ind4t = np.arange(len(y_4t))
...
for i, text in enumerate(ind4t):
if x_4t<72:
ax4t.annotate(str(x_4t)[:-2]+"%", xy=(x_4t+2,ind4t+0.4),fontsize=9, color='black', va='center', ha='left')
elif x_4t>=72:
ax4t.annotate(str(x_4t[i])[:-2]+"%", xy=(x_4t[i]-2,ind4t[i]+0.4),fontsize=9, color='white', va='center', ha='right')
Here's the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-e192a1699cc5> in <module>()
174 # for i, text in enumerate(ind4t):
--> 175 if x_4t<72:
176 ax4t.annotate(str(x_4t)[:-2]+"%", xy=(x_4t+2,ind4t+0.4),fontsize=9, color='black', va='center', ha='left')
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
729 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 730 .format(self.__class__.__name__))
731
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-42-e192a1699cc5> in <module>()
185 except:
186 for i, text in enumerate(ind4t):
--> 187 ax4t.annotate(str(x_4t[i])[:-2]+"%", xy=(x_4t[i]+2,ind4t[i]+0.4),fontsize=9, color='black', va='center', ha='left')
188 for i, text in enumerate(ind5t):
189 ax5t.annotate(str(x_5t[i])[:-2]+"%", xy=(x_5t[i]+2,ind5t[i]+0.4),fontsize=9, color='black', va='center', ha='left')
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
549 def __getitem__(self, key):
550 try:
--> 551 result = self.index.get_value(self, key)
552
553 if not np.isscalar(result):
C:\Users\m\Anaconda3\lib\site-packages\pandas\core\index.py in get_value(self, series, key)
1721
1722 try:
-> 1723 return self._engine.get_value(s, k)
1724 except KeyError as e1:
1725 if len(self) > 0 and self.inferred_type in ['integer','boolean']:
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3204)()
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2903)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3843)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6525)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6463)()
KeyError: 0
I'll send a complete example if needed, but I just wanted to see if the answer is obvious enough to not do so.
Thanks in advance!
TL;DR; use iloc
x4t.iloc[0]
The problem is that pd.Series use index not position indexing via []. This is a powerful idea/bit of syntax as frequently when you are working with a Series the value of the index is the interesting thing, not the position in the Series (ex a date index), however if you expect that Series / DataFrames will behave exactly like a numpy array you will have issues (I speak from experience here).