How do I change the date format for a Pandas Index? - python-3.x

I'm loading some time series data in the following way:
snp = web.DataReader("^GSPC", 'yahoo', start, end)['Adj Close']
The index is then automatically formatted as 'datetime64[ns]'
I then resample the daily data to yearly like this:
snp_yr = snp.resample('A')
The date formatting is still the same as already described. How do I change this into the year only (%Y)??
E.g. from '2015-12-31 00:00:00' to '2015'

I think you need DatetimeIndex.year and then if need convert to string add astype:
df.index = df.index.year
Sample:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10, freq='3M')
df = pd.DataFrame({'a': range(10)},index=rng)
print (df)
a
2015-02-28 0
2015-05-31 1
2015-08-31 2
2015-11-30 3
2016-02-29 4
2016-05-31 5
2016-08-31 6
2016-11-30 7
2017-02-28 8
2017-05-31 9
df.index = df.index.year.astype(str)
print (df)
a
2015 0
2015 1
2015 2
2015 3
2016 4
2016 5
2016 6
2016 7
2017 8
2017 9
print (df.index)
Index(['2015', '2015', '2015', '2015', '2016', '2016', '2016', '2016', '2017',
'2017'],
dtype='object')
Another solution with strftime:
df.index = df.index.strftime('%Y')
print (df)
a
2015 0
2015 1
2015 2
2015 3
2016 4
2016 5
2016 6
2016 7
2017 8
2017 9
print (df.index)
Index(['2015', '2015', '2015', '2015', '2016', '2016', '2016', '2016', '2017',
'2017'],
dtype='object')

Related

Pandas: How to ctrate DateTime index

There is Pandas Dataframe as:
year month count
0 2014 Jan 12
1 2014 Feb 10
2 2015 Jan 12
3 2015 Feb 10
How to create DateTime index from 'year' and 'month',so result would be :
count
2014.01.31 12
2014.02.28 10
2015.01.31 12
2015.02.28 10
Use to_datetime with DataFrame.pop for use and remove columns and add offsets.MonthEnd:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.offsets.MonthEnd()
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10
Or:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.to_timedelta(dates.dt.daysinmonth - 1, unit='d')
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10

Key Error while subsetting Timeseries data using index

I have the following Timeseries data.
price_per_year.head()
price
date
2013-01-02 20.08
2013-01-03 19.78
2013-01-04 19.86
2013-01-07 19.40
2013-01-08 19.66
price_per_year.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 782 entries, 2013-01-02 to 2015-12-31
Data columns (total 1 columns):
price 756 non-null float64
dtypes: float64(1)
memory usage: 12.2 KB
I am trying to extract data for 3 years using the below code. Why is that I am getting KeyError: '2014', when the data as shown below clearly contains year '2014'. Appreciate any inputs.
price_per_year['2014'].head()
price
date
2014-01-01 NaN
2014-01-02 39.59
2014-01-03 40.12
2014-01-06 39.93
2014-01-07 40.92
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year.loc[year, ['price']].reset_index(drop=True)
price_per_year.rename(columns={'price': year}, inplace=True)
prices = pd.concat([prices, price_per_year], axis=1)
KeyError: '2014'
The code line price_per_year.loc['2014', ['price']], when used independently outside for loop, works fine, while price_per_year['price'][year] when used in the for loop doesn't work.
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year['price'][year].reset_index(drop=True)
KeyError: 'price'
Both the code lines price_per_year.loc[price_per_year.index.year == 2014, ['price']] when used independently outside for loop and price_per_year.loc[price_per_year.index.year == year, ['price']] used inside the for loop are giving errors.
for year in ['2013', '2014', '2015']:
price_per_year.loc[price_per_year.index.year == '2014', ['price']].reset_index(drop=True)
TypeError: Cannot convert input [False] of type <class 'bool'> to Timestamp
Here is problem in your first code is used partial string indexing, here is used DataFrame.loc
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
s = price_per_year['price'][year].reset_index(drop=True).rename(year)
prices = pd.concat([prices, s], axis=1)
print (prices)
2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66
Another better solution with reshape:
print (df)
price
date
2013-01-02 20.08
2013-01-03 19.78
2014-01-02 19.86
2014-01-03 19.40
2015-01-02 19.66
2015-01-03 19.66
y = df.index.year
df = df.set_index([df.groupby(y).cumcount(), y])['price'].unstack()
print (df)
date 2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66

Pandas Dataframe: Reduce the value of a 'Days' by 1 if the corresponding 'Year' is a leap year

If 'Days' is greater than e.g 10 and corresponding 'Year' is a leap year, then reduce 'Days' by 1 only in that particular row. I tried some operations but couldn't do it. I am new in pandas. Appreciate any help.
sample data:
data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['69','2008']]
df=pd.DataFrame(data,columns=['Days','Year'])
I want 'Days' of row 5 to become 69 and everything else remains the same.
In [98]: import calendar
In [99]: data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['70','2008']] ;df=pd.DataFrame(data,column
...: s=['Days','Year'])
In [100]: df = df.astype(int)
In [102]: df["New_Days"] = df.apply(lambda x: x["Days"]-1 if (x["Days"] > 10 and calendar.isleap(x["Year"])) else x["D
...: ays"], axis=1)
In [103]: df
Out[103]:
Days Year New_Days
0 1 2005 1
1 2 2006 2
2 3 2008 3
3 50 2009 50
4 70 2008 69

How to run a script on x axis of plots in matplotlib [duplicate]

I want to transform an integer between 1 and 12 into an abbrieviated month name.
I have a df which looks like:
client Month
1 sss 02
2 yyy 12
3 www 06
I want the df to look like this:
client Month
1 sss Feb
2 yyy Dec
3 www Jun
Most of the info I found was not in python>pandas>dataframe hence the question.
You can do this efficiently with combining calendar.month_abbr and df[col].apply()
import calendar
df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])
Since the abbreviated month names is the first three letters of their full names, we could first convert the Month column to datetime and then use dt.month_name() to get the full month name and finally use str.slice() method to get the first three letters, all using pandas and only in one line of code:
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name().str.slice(stop=3)
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
The calendar module is useful, but calendar.month_abbr is array-like: it cannot be used directly in a vectorised fashion. For an efficient mapping, you can construct a dictionary and then use pd.Series.map:
import calendar
d = dict(enumerate(calendar.month_abbr))
df['Month'] = df['Month'].map(d)
Performance benchmarking shows a ~130x performance differential:
import calendar
d = dict(enumerate(calendar.month_abbr))
mapper = calendar.month_abbr.__getitem__
np.random.seed(0)
n = 10**5
df = pd.DataFrame({'A': np.random.randint(1, 13, n)})
%timeit df['A'].map(d) # 7.29 ms per loop
%timeit df['A'].map(mapper) # 946 ms per loop
Solution 1: One liner
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.strftime('%b')
Solution 2: Using apply()
def mapper(month):
return month.strftime('%b')
df['Month'] = df['Month'].apply(mapper)
Reference:
http://strftime.org/
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
using datetime object methods
I'm surpised this answer doesn't have a solution using strftime
note, you'll need to have a valid datetime object before using the strftime method, use pd.to_datetime(df['date_column']) to cast your target column to a datetime object.
import pandas as pd
dates = pd.date_range('01-Jan 2020','01-Jan 2021',freq='M')
df = pd.DataFrame({'dates' : dates})
df['month_name'] = df['dates'].dt.strftime('%b')
dates month_name
0 2020-01-31 Jan
1 2020-02-29 Feb
2 2020-03-31 Mar
3 2020-04-30 Apr
4 2020-05-31 May
5 2020-06-30 Jun
6 2020-07-31 Jul
7 2020-08-31 Aug
8 2020-09-30 Sep
9 2020-10-31 Oct
10 2020-11-30 Nov
11 2020-12-31 Dec
another method would be to slice the name using dt.month_name()
df['month_name_str_slice'] = df['dates'].dt.month_name().str[:3]
dates month_name month_name_str_slice
0 2020-01-31 Jan Jan
1 2020-02-29 Feb Feb
2 2020-03-31 Mar Mar
3 2020-04-30 Apr Apr
4 2020-05-31 May May
5 2020-06-30 Jun Jun
6 2020-07-31 Jul Jul
7 2020-08-31 Aug Aug
8 2020-09-30 Sep Sep
9 2020-10-31 Oct Oct
10 2020-11-30 Nov Nov
11 2020-12-31 Dec Dec
You can do this easily with a column apply.
import pandas as pd
df = pd.DataFrame({'client':['sss', 'yyy', 'www'], 'Month': ['02', '12', '06']})
look_up = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May',
'06': 'Jun', '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Month'] = df['Month'].apply(lambda x: look_up[x])
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
One way of doing that is with the apply method in the dataframe but, to do that, you need a map to convert the months. You could either do that with a function / dictionary or with Python's own datetime.
With the datetime it would be something like:
def mapper(month):
date = datetime.datetime(2000, month, 1) # You need a dateobject with the proper month
return date.strftime('%b') # %b returns the months abbreviation, other options [here][1]
df['Month'].apply(mapper)
In a simillar way, you could build your own map for custom names. It would look like this:
months_map = {01: 'Jan', 02: 'Feb'}
def mapper(month):
return months_map[month]
Obviously, you don't need to define this functions explicitly and could use a lambda directly in the apply method.
Use strptime and lambda function for this:
from time import strptime
df['Month'] = df['Month'].apply(lambda x: strptime(x,'%b').tm_mon)
Suppose we have a DF like this, and Date is already in DateTime Format:
df.head(3)
value
date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
Then we can extract month number and month name easily like this :
df['month_num'] = df.index.month
df['month'] = df.index.month_name()
value year month_num month
date
2017-01-06 37353 2017 1 January
2019-01-06 94108 2019 1 January
2019-01-05 77897 2019 1 January
2019-01-04 94514 2019 1 January
Having tested all of these on a large dataset, I have found the following to be fastest:
import calendar
def month_mapping():
# I'm lazy so I have a stash of functions already written so
# I don't have to write them out every time. This returns the
# {1:'Jan'....12:'Dec'} dict in the laziest way...
abbrevs = {}
for month in range (1, 13):
abbrevs[month] = calendar.month_abbr[month]
return abbrevs
abbrevs = month_mapping()
df['Month Abbrev'} = df['Date Col'].dt.month.map(mapping)
You can use Pandas month_name() function. Example:
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')
For more detail visit this link.
the best way would be to do with month_name() as commented by
Nurul Akter Towhid.
df['Month'] = df.Month.dt.month_name()
First you need to strip "0 " in the beginning (as u might get the exception leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers)
step1)
def func(i):
if i[0] == '0':
i = i[1]
return(i)
df["Month"] = df["Month"].apply(lambda x: func(x))
Step2:
df["Month"] = df["Month"].apply(lambda x: calendar.month_name(x))

When converting Date entry in pandas, how to ignore empty values?

I have an excel with some date entries as below.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
1 Closed 2/1/2017 7:15 AM 2/1/2017 10:44 AM 2/21/2017 11:50 AM
2 Assigned 2/2/2017 2:09 PM
3 Resolved 2/8/2017 10:32 AM 9/11/2017 8:49 PM
4 Closed 8/27/2018 6:00 AM 10/15/2018 9:10 AM 10/15/2018 9:10 AM
5 Resolved 12/26/2018 3:25 PM 2/11/2019 9:08 AM
Initially I'm converting them from the above pattern to $year-$mm-$dd.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
1 Closed 2017-02-01 2017-02-01 2017-02-21
2 Assigned 2017-02-02 NaN NaN
3 Resolved 2017-02-08 2017-09-11 NaN
4 Closed 2018-08-27 2018-10-15 2018-10-15
5 Resolved 2018-12-26 2019-02-11 NaN
With these converted dates, I'm trying extract month and year in the format $mon $year.
I'm using following code to extract month and year.
df['Month Opened'] = pd.to_datetime(df["Date/Time Opened"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
When I have applied this formula with 'Date/Time Opened', I can see that it works as below.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed Month Opened
1 Closed 2017-02-01 2017-02-01 2017-02-21 Feb 2017
2 Assigned 2017-02-02 NaN NaN Feb 2017
3 Resolved 2017-02-08 2017-09-11 NaN Feb 2017
4 Closed 2018-08-27 2018-10-15 2018-10-15 Aug 2018
5 Resolved 2018-12-26 2019-02-11 NaN Dec 2018
Here's my complete code - http://tpcg.io/X5S8Pe
import pandas as pd
import calendar
CaseDetails = {
'Case Number': [1, 2, 3, 4, 5],
'Status': ['Closed', 'Assigned', 'Resolved', 'Closed', 'Resolved'],
'Date/Time Opened': ['2/1/2017 7:15 AM', '2/2/2017 2:09 PM', '2/8/2017 10:32 AM', '8/27/2018 6:00 AM', '12/26/2018 3:25 PM'],
'Date/Time Resolved': ['2/1/2017 10:44 AM', '', '9/11/2017 8:49 PM', '10/15/2018 9:10 AM', '2/11/2019 9:08 AM'],
'Date/Time Closed': ['2/21/2017 11:50 AM', '', '', '10/15/2018 9:10 AM', '']
}
df = pd.DataFrame(CaseDetails,columns= ['Case Number', 'Status', 'Date/Time Opened', 'Date/Time Resolved', 'Date/Time Closed'])
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).dt.date
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).dt.date
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).dt.date
print (df)
df['Month Opened'] = pd.to_datetime(df["Date/Time Opened"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
print (df)
As expected, my code converted entries under 'Date/Time Opened' to the desired format. When tried converting other 2 dates columns, I get the following error.
Traceback (most recent call last):
File "main.py", line 21, in <module>
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
File "/usr/lib64/python2.7/site-packages/pandas/core/series.py", line 2158, in map
new_values = map_f(values, arg)
File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
File "main.py", line 21, in <lambda>
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
File "/usr/lib64/python2.7/calendar.py", line 56, in __getitem__
funcs = self._months[i]
TypeError: list indices must be integers, not float
I wanted to know is there a way to covert the columns with empty values?
Here is possible use Series.dt.strftime - it working nice with missing values:
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).dt.strftime('%b %Y')
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).dt.strftime('%b %Y')
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).dt.strftime('%b %Y')
Alternative is use apply with list of columns:
cols = ['Date/Time Opened','Date/Time Resolved','Date/Time Closed']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x).dt.strftime('%b %Y'))
print (df)
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
0 1 Closed Feb 2017 Feb 2017 Feb 2017
1 2 Assigned Feb 2017 NaT NaT
2 3 Resolved Feb 2017 Sep 2017 NaT
3 4 Closed Aug 2018 Oct 2018 Oct 2018
4 5 Resolved Dec 2018 Feb 2019 NaT
In your solution is possible use nice trick np.nan != np.nan, so in your function is added if-else statement:
f = lambda x: calendar.month_abbr[x.month] + " " + str(x.year) if x == x else np.nan
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).map(f)
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).map(f)
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).map(f)
print (df)
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
0 1 Closed Feb 2017 Feb 2017 Feb 2017
1 2 Assigned Feb 2017 NaN NaN
2 3 Resolved Feb 2017 Sep 2017 NaN
3 4 Closed Aug 2018 Oct 2018 Oct 2018
4 5 Resolved Dec 2018 Feb 2019 NaN
Or alternative:
f = lambda x: calendar.month_abbr[x.month] + " " + str(x.year) if x == x else np.nan
cols = ['Date/Time Opened','Date/Time Resolved','Date/Time Closed']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x).map(f))

Resources