Date stuck as unformattable in pandas dataframe - python-3.x

I am trying to plot time series data and my date column is stuck like this, and I cannot seem to figure out what datatype it is to change it, as adding verbose = True doesn't yield any explanation for the data.
Here is a screenshot of the output Date formatting
Here is the code I have for importing the dataset and all the formatting I've done to it. Ultimately, I want date and time to be separate values, but I cannot figure out why it's auto formatting it and applying today's date.
df = pd.read_csv('dataframe.csv')
df['Avg_value'] = (df['Open'] + df['High'] + df['Low'] + df['Close'])/4
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
Any help would be appreciated
The output I'd be looking for would be something like:
Date: 2016-01-04 10:00:00
As one column rather than 2 separate ones.

When you pass a Pandas Series into pd.to_datetime(...), it parses the values and returns a new Series of dtype datetime64[ns] containing the parsed values:
>>> pd.to_datetime(pd.Series(["12:30:00"]))
0 2021-08-24 12:30:00
dtype: datetime64[ns]
The reason you see today's date is that a datetime64 value must have both date and time information. Since the date is missing, the current date is substituted by default.
Pandas does not really have a dtype that supports "just time of day, no date" (there is one that supports time intervals, such as "5 minutes"). The Python standard library does, however: the datetime.time class.
Pandas provides a convenience function called the .dt accessor for extracting a Series of datetime.time objects from a Series of datetime64 objects:
>>> pd.to_datetime(pd.Series(["12:30:00"])).dt.time
0 12:30:00
dtype: object
Note that the dtype is just object which is the fallback dtype Pandas uses for anything which is not natively supported by Pandas or NumPy. This means that working with these datetime.time objects will be a bit slower than working with a native Pandas dtype, but this probably won't be a bottleneck for your application.
Recommended reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

Related

Converting simple returns to monthly log returns

I have a pandas DataFrame with simple daily returns. I need to convert it to monthly log returns and add a column to the current DataFrame. I have to use np.log to compute the monthly return. But I can only compute daily log return. Below is my code.
df[‘return_monthly’]= np.log(data([‘Simple Daily Returns’]+1)
The code only produces daily log returns. Is there any particular methods I should be using in the above code to get monthly return??
Please see my input for pandas Dataframe, the third column in excel is the expected out.
The question is a little confusing, but it seems like you want to group the rows by month. This can be done with pandas.resample if you have a datetime index, pandas.groupby, or pandas.pivot.
Here is a simple implementation, let us know if this isn't what you're looking for. Furthermore, your values are less than 1, so the log is negative. You can adjust as needed. I aggregated the months with sum, but there are many other aggregation functions such as mean(), median(), size() and many more. See the link for a list of aggregating functions.
#create dataframe with 1220 values that match your dataset
df = pd.DataFrame({
'Date':pd.date_range(start = '1/1/2019' , end ='5/4/2022' , freq='1D'),
'Return':np.random.uniform(low=1e-6, high=1.0, size=1220) #avoid log 0 which returns NAN
}).set_index('Date') #set the index to the date so we can use resample
Return Log_return
Date
2019-01-31 14.604863 -33.950987
2019-02-28 13.118111 -32.025086
2019-03-31 14.541947 -32.962914
2019-04-30 14.212689 -33.684422
2019-05-31 14.154918 -33.347081
2019-06-30 10.710209 -43.474120
2019-07-31 12.358001 -43.051723
2019-08-31 17.932673 -30.328784
...

Python Pandas: Supporting 25 hours in datetime index

I want to use a date/time as an index for a dataframe in Pandas.
However, daylight saving time is not properly addressed in the database, so the date/time values for the day in which daylight saving time ends have 25 hours and are represented as such:
2019102700
2019102701
...
2019102724
I am using the following code to convert those values to a DateTime object that I use as an index to a Pandas dataframe:
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
However, that gives an error:
ValueError: unconverted data remains: 4
Presumably because the to_datetime function is not expecting the hour to be 24. Similarly, the day in which daylight saving time starts only has 23 hours.
One solution I thought of was storing the dates as strings, but that seems neither elegant nor efficient. Is there any way to solve the issue of handling daylight saving time when using to_datetime?
If you know the timezone, here's a way to calculate UTC timestamps. Parse only the date part, localize to the actual time zone the data "belongs" to, and convert that to UTC. Now you can parse the hour part and add it as a time delta - e.g.
import pandas as pd
df = pd.DataFrame({'date_time_str': ['2019102722','2019102723','2019102724',
'2019102800','2019102801','2019102802']})
df['date_time'] = (pd.to_datetime(df['date_time_str'].str[:-2], format='%Y%m%d')
.dt.tz_localize('Europe/Berlin')
.dt.tz_convert('UTC'))
df['date_time'] += df['date_time_str'].str[-2:].astype('timedelta64[h]')
# df['date_time']
# 0 2019-10-27 20:00:00+00:00
# 1 2019-10-27 21:00:00+00:00
# 2 2019-10-27 22:00:00+00:00
# 3 2019-10-27 23:00:00+00:00
# 4 2019-10-28 00:00:00+00:00
# 5 2019-10-28 01:00:00+00:00
# Name: date_time, dtype: datetime64[ns, UTC]
I'm not sure if it is the most elegant or efficient solution, but I would:
df.loc[df.date_time.str[-2:]=='25', 'date_time'] = (pd.to_numeric(df.date_time[df.date_time.str[-2:]=='25'])+100-24).apply(str)
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
Pick the first and the last index, convert them to tz_aware datetime, then you can generate a date_range that handles 25-hour days. And assign the date_range to your df index:
start = pd.to_datetime(df.index[0]).tz_localize("Europe/Berlin")
end = pd.to_datetime(df.index[-1]).tz_localize("Europe/Berlin")
index_ = pd.date_range(start, end, freq="15min")
df = df.set_index(index_)

Pandas - Exclude Timezone when using .apply(pd.to_datetime) [duplicate]

I have been struggling with removing the time zone info from a column in a pandas dataframe. I have checked the following question, but it does not work for me:
Can I export pandas DataFrame to Excel stripping tzinfo?
I used tz_localize to assign a timezone to a datetime object, because I need to convert to another timezone using tz_convert. This adds an UTC offset, in the way "-06:00". I need to get rid of this offset, because it results in an error when I try to export the dataframe to Excel.
Actual output
2015-12-01 00:00:00-06:00
Desired output
2015-12-01 00:00:00
I have tried to get the characters I want using the str() method, but it seems the result of tz_localize is not a string. My solution so far is to export the dataframe to csv, read the file, and to use the str() method to get the characters I want.
Is there an easier solution?
If your series contains only datetimes, then you can do:
my_series.dt.tz_localize(None)
This will remove the timezone information ( it will not change the time) and return a series of naive local times, which can be exported to excel using to_excel() for example.
Maybe help strip last 6 chars:
print df
datetime
0 2015-12-01 00:00:00-06:00
1 2015-12-01 00:00:00-06:00
2 2015-12-01 00:00:00-06:00
df['datetime'] = df['datetime'].astype(str).str[:-6]
print df
datetime
0 2015-12-01 00:00:00
1 2015-12-01 00:00:00
2 2015-12-01 00:00:00
To remove timezone from all datetime columns in a DataFrame with mixed columns just use:
for col in df.select_dtypes(['datetimetz']).columns:
df[col] = df[col].dt.tz_localize(None)
if you can't save df to excel file just use this (not delete timezone!):
for col in df.select_dtypes(['datetimetz']).columns:
df[col] = df[col].dt.tz_convert(None)
Following Beatriz Fonseca's suggestion, I ended up doing the following:
from datetime import datetime
df['dates'].apply(lambda x:datetime.replace(x,tzinfo=None))
If it is always the last 6 characters that you want to ignore, you may simply slice your current string:
>>> '2015-12-01 00:00:00-06:00'[0:-6]
'2015-12-01 00:00:00'

PANDAS date summarisation

I have a pandas dataframe that looks like:
import pandas as pd
df= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22']})
df
I want to aggregate the dates so it shows the first four quarters and then the year end. For the dataset above, I would expect:
df1= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22'],'Mat Group':['1Q18','1Q18','2Q18','FY19','FY20']})
df1
right now I achieve this using a set of loc statements such as :
df.loc[(df['Maturity Date'] >'2018-01-01') & (df['Maturity Date'] <='2018-03-31'),'Mat Group']="1Q18"
df.loc[(df['Maturity Date'] >'2018-04-01') & (df['Maturity Date'] <='2018-06-30'),'Mat Group']="2Q18"
I was wondering if there is a more elegant way to achieve the same result? Perhaps have the buckets in a list and parse through the list so that the bucketing can be made more flexible ?
A bit specific. I would use.
the strftime format %y to get the short
the pandas built-in quarter to get the quarter
the python format function to construct strings
a lambda to apply it to the column
Here is the result. Maybe there is a better answer, but this one is pretty concise.
df['Mat Group'] = df['Maturity Date'].apply(
lambda x: '{}Q{:%y}'.format(x.quarter, x) if x.year < 2019
else 'FY{:%y}'.format(x))
df
# Amount Maturity Date type Mat Group
# 0 10 2018-01-22 Asset 1Q18
# 1 -10 2018-02-22 Liability 1Q18
# 2 20 2018-06-22 Asset 2Q18
# 3 -20 2019-06-22 Liability FY19
# 4 5 2020-01-22 Asset FY20

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Resources