Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp - string

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?

AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Related

Python Pandas: Supporting 25 hours in datetime index

I want to use a date/time as an index for a dataframe in Pandas.
However, daylight saving time is not properly addressed in the database, so the date/time values for the day in which daylight saving time ends have 25 hours and are represented as such:
2019102700
2019102701
...
2019102724
I am using the following code to convert those values to a DateTime object that I use as an index to a Pandas dataframe:
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
However, that gives an error:
ValueError: unconverted data remains: 4
Presumably because the to_datetime function is not expecting the hour to be 24. Similarly, the day in which daylight saving time starts only has 23 hours.
One solution I thought of was storing the dates as strings, but that seems neither elegant nor efficient. Is there any way to solve the issue of handling daylight saving time when using to_datetime?
If you know the timezone, here's a way to calculate UTC timestamps. Parse only the date part, localize to the actual time zone the data "belongs" to, and convert that to UTC. Now you can parse the hour part and add it as a time delta - e.g.
import pandas as pd
df = pd.DataFrame({'date_time_str': ['2019102722','2019102723','2019102724',
'2019102800','2019102801','2019102802']})
df['date_time'] = (pd.to_datetime(df['date_time_str'].str[:-2], format='%Y%m%d')
.dt.tz_localize('Europe/Berlin')
.dt.tz_convert('UTC'))
df['date_time'] += df['date_time_str'].str[-2:].astype('timedelta64[h]')
# df['date_time']
# 0 2019-10-27 20:00:00+00:00
# 1 2019-10-27 21:00:00+00:00
# 2 2019-10-27 22:00:00+00:00
# 3 2019-10-27 23:00:00+00:00
# 4 2019-10-28 00:00:00+00:00
# 5 2019-10-28 01:00:00+00:00
# Name: date_time, dtype: datetime64[ns, UTC]
I'm not sure if it is the most elegant or efficient solution, but I would:
df.loc[df.date_time.str[-2:]=='25', 'date_time'] = (pd.to_numeric(df.date_time[df.date_time.str[-2:]=='25'])+100-24).apply(str)
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
Pick the first and the last index, convert them to tz_aware datetime, then you can generate a date_range that handles 25-hour days. And assign the date_range to your df index:
start = pd.to_datetime(df.index[0]).tz_localize("Europe/Berlin")
end = pd.to_datetime(df.index[-1]).tz_localize("Europe/Berlin")
index_ = pd.date_range(start, end, freq="15min")
df = df.set_index(index_)

Iterate over pandas dataframe while updating values

I've looked through a bunch of similar questions, but I cannot figure out how to actually apply the principles to my own case. I'm therefore trying to figure out a simple example I can work from - basically I need the idiots' guide before I can look at more complex examples
Consider a dataframe that contains a list of names and times, and a known start time. I then want to update the dataframe with the finish time, which is calculated from starttime + Time
import pandas as pd
import datetime
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
I know that for each case I can calculate the finish time using
finishtime = starttine + datetime.datetime.timedelta(minutes = df.iloc[0,1])
what I can't figure out is how to use this while iterating over the df rows and updating a third column in the dataframe with the output.
I tried
df["FinishTime"] = np.nan
for row in df.itertuples():
df.at[row,"FinishTime"] = starttine + datetime.datetime.timedelta(minutes = row.Time)
but it gave a lot of errors I couldn't unravel. How am I meant to do this?
I am aware that the advice to iterating over a dataframe is don't - I'm not committed to iterating, I just need some way to calculate that final column and add it to the dataframe. My real data is about 200k lines.
Use pd.to_timedelta()
import datetime
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
df.Time = pd.to_timedelta(df.Time, unit='m')
# df = df.assign(FinishTime = df.Time + starttime)
df['FinishTime'] = df.Time + starttime # as pointed out by Trenton McKinney, .assign() is only one way to create new columns
# creating with df['new_col'] has the benefit of not having to copy the full df
print(df)
Output
Name Time FinishTime
0 Kate 00:03:00 2020-02-04 00:03:00
1 Sarah 00:06:00 2020-02-04 00:06:00
2 Isabell 00:01:00 2020-02-04 00:01:00
3 Connie 00:07:00 2020-02-04 00:07:00
4 Elsa 00:23:00 2020-02-04 00:23:00
5 Anne 00:03:00 2020-02-04 00:03:00
6 Lin 00:04:00 2020-02-04 00:04:00
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html
Avoid looping in pandas at all cost
Maybe not at all cost, but pandas takes advantage of C implementations to improve performance by several orders of magnitude. There are many (many) common functions already implemented for our convenience.
Here is a great stackoverflow conversation about this very topic.

Trouble selecting entire day from DateTime object in pandas DataFrame?

I am having trouble selecting an entire day from a DateTime column in my DataFrame.
I originally started with the date and time in separate columns, and it was a simple thing to select all rows containing a specific date. One DateTime column seemed more convenient, but I have not been able to find out how to select all entries for a specific date. When I don't specify a time, I get an empty DataFrame.
#minimal example
import pandas as pd
df = pd.DataFrame({'date_time':['2019-01-01 00:00:00', '2019-01-01 00:00:00', '2019-01-01 00:30:00', '2019-01-01 01:00:00']})
I can select specific times no problem:
df[df.date_time == '2019-01-01 00:00:00']
But this gives an empty DataFrame:
df[df.date_time == '2019-01-01']
What I want it to return is every entry that has the specified date, regardless of the time.
First convert datetime into date
Then convert it in string
df['date_time'] = pd.to_datetime(df['date_time']).dt.date.astype(str)
df[df['date_time'] == '2019-01-01']
Your df has date_time as 'object'. You should first convert it to 'date_time' with
df.date_time = pd.to_datetime(df.date_time)
This will do the trick. If you try now:
df[df.date_time == '2019-01-01']
you'll get your desired result (you will; note 2 records because they both comes at 00:00):
date_time
0 2019-01-01
1 2019-01-01
However, if you want to completely ignore the time, you should add this:
df.date_time = pd.to_datetime(df.date_time)
df['date'] = pd.to_datetime(df['date_time'].dt.date)
df[df.date == '2019-01-01']
to make sure your time is out and only then do:
df[df.date == '2019-01-01']
and the desired result:
date_time date
0 2019-01-01 2019-01-01
1 2019-01-01 2019-01-01
2 2019-01-01 2019-01-01
3 2019-01-01 2019-01-01

Date Format changing automatically in pandas data frame [duplicate]

Im learning python (3.6 with anaconda) for my studies.
Im using pandas to import a xls file with 2 columns : Date (dd-mm-yyyy) and price.
But pandas changes the date format :
xls_file = pd.read_excel('myfile.xls')
print(xls_file.iloc[0, 0])
Im getting :
2010-01-04 00:00:00
instead of :
04-01-2010 or at least : 2010-01-04
I dont know why hh:mm:ss is added, I get the same result for each row from the Date column. I tried also different things using to_datetime but it didnt fix it.
Any idea ?
Thanks
What you need is to define the format that the datetime values get printed. There might be a more elegant way to do it but something like that will work:
In [11]: df
Out[11]:
id date
0 1 2017-09-12
1 2 2017-10-20
# Specifying the format
In [16]: print(pd.datetime.strftime(df.iloc[0,1], "%Y-%m-%d"))
2017-09-12
If you want to store the date as string in your specific format then you can also do something like:
In [17]: df["datestr"] = pd.datetime.strftime(df.iloc[0,1], "%Y-%m-%d")
In [18]: df
Out[18]:
id date datestr
0 1 2017-09-12 2017-09-12
1 2 2017-10-20 2017-09-12
In [19]: df.dtypes
Out[19]:
id int64
date datetime64[ns]
datestr object
dtype: object

From Dataframe to Datestamp python3

recently I faced a really weird csv file with 2 columns (with headers), one for dates and the second one for prices. The time format was "dd.mm.yyyy".
d = {'Date': [31.12.1991, 02.01.1992, 03.01.1992, 06.01.1992],
'Prices': [9.62, 9.5, 9.73, 9.45]}
df = pd.DataFrame(data=d)
prices = pd.DataFrame(df['Prices'])
date = pd.DataFrame(df['Date'])
date = date.to_string(header=True)
date = df.to_datetime(utc=True, infer_datetime_format=True)
frame = date.join(values)
print(df)
I tried to make it work by isolating the date column and trying to transform it first into string with the to_string() function and then back to date with the to_datetime but it was no use.
Any suggestions?
Thanks in advance
Interesting way to generalize for whole dataframe
Note This uses errors='ignore' in order to skip columns that might not be suitable for parsing as dates. However, the trade off is that if there is a column that is intended to be parsed as dates but has a bad date value, this approach will leave that column unaltered. The point is to make sure you don't have bad date values.
df.assign(
**df.select_dtypes(exclude=[np.number]).apply(
pd.to_datetime, errors='ignore', dayfirst=True
)
)
Date Prices
0 1991-12-31 9.62
1 1992-01-02 9.50
2 1992-01-03 9.73
3 1992-01-06 9.45
Another example
df = pd.DataFrame(dict(
A=1, B='B', C='6.7.2018', D=1-1j,
E='1.2.2017', F=pd.Timestamp('2016-08-08')
), [0])
df
A B C D E F
0 1 B 6.7.2018 (1-1j) 1.2.2017 2016-08-08
df.assign(
**df.select_dtypes(exclude=[np.number]).apply(
pd.to_datetime, errors='ignore', dayfirst=True
)
)
A B C D E F
0 1 B 2018-07-06 (1-1j) 2017-02-01 2016-08-08
Setup
borrowed from jezrael
d = {'Date': ['31.12.1991', '02.01.1992', '03.01.1992', '06.01.1992'],
'Prices': [9.62, 9.5, 9.73, 9.45]}
df = pd.DataFrame(data=d)
You could try to parse the dates when you read in the file. You can specify that the format has the day first instead of the month.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=['Date'], dayfirst=True)
print(df)
# Date Prices
#0 1991-12-31 9.62
#1 1992-01-02 9.50
#2 1992-01-03 9.73
#3 1992-01-06 9.45
df.dtypes
#Date datetime64[ns]
#Prices float64
#dtype: object
However, your data really need to be clean and properly formatted for this to work:
parse_dates:
If a column or index contains an unparseable date, the entire column
or index will be returned unaltered as an object data type. For
non-standard datetime parsing, use pd.to_datetime after pd.read_csv
Sample Data: test.csv
Date,Prices
31.12.1991,9.62
02.01.1992,9.5
03.01.1992,9.73
06.01.1992,9.45
I believe need:
d = {'Date': ['31.12.1991', '02.01.1992', '03.01.1992', '06.01.1992'],
'Prices': [9.62, 9.5, 9.73, 9.45]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df)
Date Prices
0 1991-12-31 9.62
1 1992-01-02 9.50
2 1992-01-03 9.73
3 1992-01-06 9.45

Resources