pandas timeseries select single date - python-3.x

Selecting a single date from a timeserie gives a KeyError.
Setup:
import pandas as pd
import numpy as np
ts = pd.DataFrame({'date': pd.date_range(start = '1/1/2017', periods = 5),
'observations': np.random.choice(range(0, 100), 5, replace = True)}).set_index('date')
Dataframe:
observations
date
2017-01-01 58
2017-01-02 88
2017-01-03 53
2017-01-04 4
2017-01-05 26
How do I select the number of observations for a single date?
ts['2017-01-01']
Returns: KeyError: '2017-01-01'
But...
ts['2017-01-01':'2017-01-01']
...seems to work just fine.
Any suggestions how to select/subset with a single date?

As #scnerd pointed out, when you do ts['2017-01-01'] it tries to find '2017-01-01' as a column's name of the dataframe ts, which gives you an KeyError as none of the columns in ts has this name
In order to look for an index' name, as in your example 'date' is set as index, you need to use loc method such as ts.loc['2017-01-01'] and you will get:
observations 54
Name: 2017-01-01 00:00:00, dtype: int32

Related

pandas computation on rolling 1 calendar month

I have a pandas DataFrame with date as the index and a column, 'spendings'. I intend to get the rolling max() of the 'spendings' column for the trailing 1 calendar month (not 30 days or 4 weeks).
I tried to capture a snippet with custom data for addressing the problem, below (borrowed from Pandas monthly rolling operation):
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings
20210325 15
20210405 20
20210415 10
20210425 40
20210505 3
20210515 2
20210525 2
20210527 1
"""
)
df = pd.read_csv(data,sep="\s+", parse_dates=True)
df.index = pd.to_datetime(df.date, format='%Y%m%d')
del(df['date'])
Now, to create a column 'max' to hold rolling last 1 calendar month's max() val, I use:
df['max'] = df.loc[(df.index - pd.tseries.offsets.DateOffset(months=1)):df.index, 'spendings'].max()
This raises an exception like:
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [DatetimeIndex(['2021-02-25', '2021-03-05', '2021-03-15', '2021-03-25',
'2021-04-05', '2021-04-15', '2021-04-25'],
dtype='datetime64[ns]', name='date', freq=None)] of type DatetimeIndex
However, if I manually access a random month window like below, it works without exception:
>>> df['2021-04-16':'2021-05-15']
spendings
date
2021-04-25 40
2021-05-05 3
2021-05-15 2
(I could have followed the method using list comprehension here: https://stackoverflow.com/a/47199274/235415, but I would like to use panda's vectorized method. I have many DataFrames and each is very large - using list comprehension is very slow here).
Q: How to get the vectorized method of performing rolling 1 calendar month's max()?
The expected o/p, ie primarily the 'max' column (holding the max value of 'spendings' for last 1 calendar month) will be something like this:
>>> df
spendings max
date
2021-03-25 15 15
2021-04-05 20 20
2021-04-15 10 20
2021-04-25 40 40
2021-05-05 3 40
2021-05-15 2 40
2021-05-25 2 40
2021-05-27 1 3
The answer will be
[df.loc[x- pd.tseries.offsets.DateOffset(months=1):x, 'spendings'].max() for x in df.index]
Out[53]: [15, 20, 20, 40, 40, 40, 40, 3]

Pandas [.dt] property vs to_datetime

The question is intended to build an understandable grasp on subtle differences between .dt and pd.to_datetime
I want understand which method is suited/preferred and if one can be used as a defacto and other differences that are there between the two
values = {'date_time': ['20190902093000','20190913093000','20190921200000'],
}
df = pd.DataFrame(values, columns = ['date_time'])
df['date_time'] = pd.to_datetime(df['date_time'], format='%Y%m%d%H%M%S')
>>> df
date_time
0 2019-09-02 09:30:00
1 2019-09-13 09:30:00
2 2019-09-21 20:00:00
Using .dt
df['date'] = df['date_time'].dt.date
>>> df
date_time date
0 2019-09-02 09:30:00 2019-09-02
1 2019-09-13 09:30:00 2019-09-13
2 2019-09-21 20:00:00 2019-09-21
>>> df.dtypes
date_time datetime64[ns]
date object
dtype: object
>>> df.date.values
array([datetime.date(2019, 9, 2), datetime.date(2019, 9, 13),
datetime.date(2019, 9, 21)], dtype=object)
Using .dt , even though the elements are individually datetime , is inferred as object in the 'DataFrame` , which sometimes is suited but mostly its causes a lot of problems down the line and an implicit conversion is inevitable
Using pd.to_datetime
df['date_to_datetime'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
>>> df.dtypes
date_time datetime64[ns]
date object
date_to_datetime datetime64[ns]
>>> df.date_to_datetime.values
array(['2019-09-02T00:00:00.000000000', '2019-09-13T00:00:00.000000000',
'2019-09-21T00:00:00.000000000'], dtype='datetime64[ns]')
Using pd.to_datetime , natively returns a datetime64[ns] array and inferred the same in the DataFrame , which in my experience is consistent and widely used , when dealing with dates using pandas
I m aware of the fact a native Date dtype does not exist in pandas , and is wrapped around datetime64[ns]
The two concepts are quite different.
pandas.to_datetime() is a function that can take a variety of inputs and convert them to a pandas datetime index. For example:
dates = pd.to_datetime([1610290846000000000, '2020-01-11', 'Jan 12 2020 2pm'])
print(dates)
# DatetimeIndex(['2021-01-10 15:00:46', '2020-01-11 00:00:00',
# '2020-01-12 14:00:00'],
# dtype='datetime64[ns]', freq=None)
pandas.Series.dt is an interface on a pandas series that gives you convenient access to operations on data stored as a pandas datetime. For example:
x = pd.Series(dates)
print(x.dt.date)
# 0 2021-01-10
# 1 2020-01-11
# 2 2020-01-12
# dtype: object
print(x.dt.hour)
# 0 15
# 1 0
# 2 14
# dtype: int64

how to set datetime type index for weekly column in pandas dataframe

I have a data as given below:
date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50
This data is saved in test.txt file.
Date column is given as a weekly column as a concatenation of year and weekid. I am trying to set the date column as an index, with given code:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
But it gives an error. How can I set the date column as an index with datetime type?
Use index_col parameter for setting index:
data=pd.read_csv("test.txt", sep="\t", index_col=[0])
EDIT: Using column name as index:
data=pd.read_csv("test.txt", sep="\t", index_col=['date'])
For converting index from int to date time, do this:
data.index = pd.to_datetime(data.index, format='%Y%m')
There might be simpler solutions than this too, using apply first I converted your Year-Weekid into Year-month-day format and then just simply used set_index to make date as index column.
import pandas as pd
data ={
'date' : [201901,201902,201903,201904,201905],
'product' : ['A','A','A','C','C'],
'price' : [10,10,10,20,20],
'amount' : [20,20,30,50,60]
}
df = pd.DataFrame(data)
# str(x)+'1' converts to Year-WeekId-Weekday, so 1 represents `Monday` so 2019020
# means 2019 Week2 Monday.
# If you want you can try with other formats too
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
df.set_index(['date'],inplace=True)
df
Edit:
To see datetime in Year-WeekID format you can style the dataframe as follows, however if you set date as index column following code won't be able to work. And also remember following code just applies some styling so just useful for display purpose only, internally it will remain as date-time object.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
style_format = {'date':'{:%Y%W}'}
df.style.format(style_format)
You also can use the date_parser parameter:
import pandas as pd
from io import StringIO
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y%m')
inputtxt = StringIO("""date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50""")
df = pd.read_csv(inputtxt, sep='\s+', parse_dates=['date'], date_parser=dateparse)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
3 amount 4 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 256.0+ bytes

How to set datetime format for pandas dataframe column labels?

IPNI_RNC PATHID 2020-11-11 00:00:00 2020-11-12 00:00:00 2020-11-13 00:00:00 2020-11-14 00:00:00 2020-11-15 00:00:00 2020-11-16 00:00:00 2020-11-17 00:00:00 Last Day Violation Count
Above are the columns label after reading the excel file. There are 10 columns in df variable after reading the excel and 7 of the columns label are date.
My input data set is an excel file which changes everyday and I want to update it automatically. In excel, some columns label are date like 11-Nov-2020, 12-Nov-2020 but after reading the excel it becomes like 2020-11-11 00:00:00, 2020-11-12 00:00:00. I want to keep column labels as 11-Nov-2020, 12-Nov-2020 while reading excel by pd.read_excel if possible or I need to convert it later.
I am very new in python. Looking forward for your support
Thanks who have already came forward to cooperate me
You can of course use the standard python methods to parse the date values, but I would not recommend it, because this way you end up with python datetime objects and not with the pandas representation of dates. That means, it consumes more space, is probably not as efficient and you can't use the pandas methods to access e.g. the year. I'll show you, what I mean below.
In case you want to avoid the naming issue of your column names, you might want to try to prevent pandas to automatically assign the names and read the first line as data to fix it yourselfe automatically (see the section below about how you can do it).
The type conversion part:
# create a test setup with a small dataframe
import pandas as pd
from datetime import date, datetime, timedelta
df= pd.DataFrame(dict(id=range(10), date_string=[str(datetime.now()+ timedelta(days=d)) for d in range(10)]))
# test the python way:
df['date_val_python']= df['date_string'].map(lambda dt: str(dt))
# use the pandas way: (btw. if you want to explicitely
# specify the format, you can use the format= keyword)
df['date_val_pandas']= pd.to_datetime(df['date_string'])
df.dtypes
The output is:
id int64
date_string object
date_val_python object
date_val_pandas datetime64[ns]
dtype: object
As you can see date_val has type object, this is because it contains python objects of class datetime while date_val_pandas uses the internal datetime representation of pandas. You can now try:
df['date_val_pandas'].dt.year
# this will return a series with the year part of the date
df['date_val_python'].dt.year
# this will result in the following error:
AttributeError: Can only use .dt accessor with datetimelike values
See the pandas doc for to_datetime for more details.
The column naming part:
# read your dataframe as usual
df= pd.read_excel('c:/scratch/tmp/dates.xlsx')
rename_dict= dict()
for old_name in df.columns:
if hasattr(old_name, 'strftime'):
new_name= old_name.strftime('DD-MMM-YYYY')
rename_dict[old_name]= new_name
if len(rename_dict) > 0:
df.rename(columns=rename_dict, inplace=True)
This works, in case your column titles are stored as usual dates, which I suppose is true, because you get a time part after importing them.
strftime of the datetime module is the function you need:
If datetime is a datetime object, you can do
datetime.strftime("%d-%b-%Y")
Example:
>>> from datetime import datetime
>>> timestamp = 1528797322
>>> date_time = datetime.fromtimestamp(timestamp)
>>> print(date_time)
2018-06-12 11:55:22
>>> print(date_time.strftime("%d-%b-%Y"))
12-Jun-2018
In order to apply a function to certain dataframe columns, use:
datetime_cols_list = ['datetime_col1', 'datetime_col2', ...]
for col in dataframe.columns:
if col in datetime_cols_list:
dataframe[col] = dataframe[col].apply(lambda x: x.strftime("%d-%b-%Y"))
I am sure this can be done in multiple ways in pandas, this is just what came out the top of my head.
Example:
import pandas as pd
import numpy as np
np.random.seed(0)
# generate some random datetime values
rng = pd.date_range('2015-02-24', periods=5, freq='T')
other_dt_col = rng = pd.date_range('2016-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Date2': other_dt_col,'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Date2 Val
# 0 2016-02-24 00:00:00 2016-02-24 00:00:00 1.764052
# 1 2016-02-24 00:01:00 2016-02-24 00:01:00 0.400157
# 2 2016-02-24 00:02:00 2016-02-24 00:02:00 0.978738
# 3 2016-02-24 00:03:00 2016-02-24 00:03:00 2.240893
# 4 2016-02-24 00:04:00 2016-02-24 00:04:00 1.867558
datetime_cols_list = ['Date', 'Date2']
for col in df.columns:
if col in datetime_cols_list:
df[col] = df[col].apply(lambda x: x.strftime("%d-%b-%Y"))
print (df)
# Output:
# Date Date2 Val
# 0 24-Feb-2016 24-Feb-2016 1.764052
# 1 24-Feb-2016 24-Feb-2016 0.400157
# 2 24-Feb-2016 24-Feb-2016 0.978738
# 3 24-Feb-2016 24-Feb-2016 2.240893
# 4 24-Feb-2016 24-Feb-2016 1.867558

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Resources