Pandas strptime returns a string instead of DateTime object [duplicate] - python-3.x

This question already has answers here:
How to convert string to datetime format in pandas python?
(3 answers)
Closed 1 year ago.
I have the following pandas DataFrame:
data = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"end_time": ["2016-01-13", "2016-01-01", "2016-11-12", "2016-01-17", "2016-03-13"]})
I want to transform the end_time column to a column of datetime objects. But when I do it like this (like it is suggested everywhere):
data["end"] = data["end_time"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
the output is still a string column:
id end_time end
0 1 2016-01-13 2016-01-13
1 2 2016-01-01 2016-01-01
2 3 2016-11-12 2016-11-12
3 4 2016-01-17 2016-01-17
4 5 2016-03-13 2016-03-13
How to solve this?

strftime is designed to return a string object, details.
If we want to convert end_time to datetime64[ns] and assign to new column named end then we can use:
data['end'] = pd.to_datetime(data.end_time)
strptime will also covert the string to datetime64[ns]. But preferable is to_datetime method.
data["end"] = data["end_time"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
data.info()
Output
id end_time end
0 1 2016-01-13 2016-01-13
1 2 2016-01-01 2016-01-01
2 3 2016-11-12 2016-11-12
3 4 2016-01-17 2016-01-17
4 5 2016-03-13 2016-03-13
Datatypes:
data.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5 non-null int64
1 end_time 5 non-null object
2 end 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 248.0+ bytes

Related

How to parse correctly to date in python from specific ISO format

I connected to a table on a db where there are two columns with dates.
I had no problem to parse the column with values formatted as this: 2017-11-03
But I don't find a way to parse the other column with dates formatted as this: 2017-10-03 05:06:52.840 +02:00
My attempts
If I parse a single value through the strptime method
dt.datetime.strptime("2017-12-14 22:16:24.037 +02:00", "%Y-%m-%d %H:%M:%S.%f %z")
I get the correct output
datetime.datetime(2017, 12, 14, 22, 16, 24, 37000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))
but if I try to use the same code format while parsing the table to the dataframe, the column dtype is an object:
Licenze_FromGY = pd.read_sql(query, cnxn, parse_dates={"EndDate":"%Y-%m-%d", "LastUpd":"%Y-%m-%d %H:%M:%S.%f %z"})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 1000 non-null object
I also tried to change the code format either in the read_sql method or in the pd.to_datetime() method, but then all the values become NaT:
Licenze_FromGY["LastUpd"] = pd.to_datetime(Licenze_FromGY["LastUpd"], format="%Y-%m-%d %H:%M:%S.%fZ", errors="coerce")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 0 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](2), int64(2), object(5)
memory usage: 71.4+ KB
None
Anyone can help?
pandas cannot handle mixed UTC offsets in one Series (column). Assuming you have data like this
import pandas as pd
df = pd.DataFrame({"datetime": ["2017-12-14 22:16:24.037 +02:00",
"2018-08-14 22:16:24.037 +03:00"]})
if you just parse to datetime,
df["datetime"] = pd.to_datetime(df["datetime"])
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: object
you get dtype object. The elements of the series are of the Python datetime.datetime dtype. That limits the datetime functionality, compared to the pandas datetime dtype.
You can get that e.g. by parsing to UTC:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True)
df["datetime"]
0 2017-12-14 20:16:24.037000+00:00
1 2018-08-14 19:16:24.037000+00:00
Name: datetime, dtype: datetime64[ns, UTC]
You might set an appropriate time zone to re-create the UTC offset:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True).dt.tz_convert("Europe/Athens")
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: datetime64[ns, Europe/Athens]

How to check if dates in a pandas column are after a date

I have a pandas dataframe
date
0 2010-03
1 2017-09-14
2 2020-10-26
3 2004-12
4 2012-04-01
5 2017-02-01
6 2013-01
I basically want to filter where dates are after 2015-12 (Dec 2015)
To get this:
date
0 2017-09-14
1 2020-10-26
2 2017-02-01
I tried this
df = df[(df['date']> "2015-12")]
but I'm getting an error
ValueError: Wrong number of items passed 17, placement implies 1
First for me working solution correct:
df = df[(df['date']> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
If convert to datetimes, which should be more robust for me working too:
df = df[(pd.to_datetime(df['date'])> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
Detail:
print (pd.to_datetime(df['date']))
0 2010-03-01
1 2017-09-14
2 2020-10-26
3 2004-12-01
4 2012-04-01
5 2017-02-01
6 2013-01-01
Name: date, dtype: datetime64[ns]

Pandas csv reader - how to force a column to be a specific data type (and replace NaN with null)

I am just getting started with Pandas and I am reading a csv file using the read_csv() method. The difficulty I am having is needing to set a column to a specific data type.
df = pd.read_csv('test_data.csv', delimiter=',', index_col=False)
my df looks like this:
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_code 4 non-null object
1 store_code 4 non-null int64
2 cost1 4 non-null float64
3 start_date 4 non-null int64
4 end_date 4 non-null int64
5 quote_reference 0 non-null float64
6 min1 4 non-null int64
7 cost2 2 non-null float64
8 min2 2 non-null float64
9 cost3 1 non-null float64
10 min3 1 non-null float64
dtypes: float64(6), int64(4), object(1)
memory usage: 480.0+ bytes
you can see that I have multiple 'min' columns min1, min2, min3
min1 is correctly detected as an int64, but min2 and min3 are float64.
this is due to min1 being fully populated, whereas min2, and min3 are sparsely populated.
here is my df:
as you can see min2 has 2 NaN values.
trying to change the data type using
df['min2'] = df['min2'].astype('int')
I get this error:
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Ideally I want to change the data type to Int, and have NaN replaced by a NULL (ie don't want a 0).
I have tried a variety of methods, ie fillna, but can't crack this.
All help greatly appreciated.
Since pandas 1.0, you can use a generic pandas.NA to replace numpy.nan. This is useful to serve as an integer NA.
To perform the convertion, use the "Int64" type (note the capital I).
df['min2'] = df['min2'].astype('Int64')
Example:
s = pd.Series([1, 2, None, 3])
s.astype('Int64')
Or:
pd.Series([1, 2, None, 3], dtype='Int64')
Output:
0 1
1 2
2 <NA>
3 3
dtype: Int64

Groupby dates quaterly in a pandas dataframe and find count for their occurence

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

how to merge month day year columns in date column?

The date is in separate columns
Month Day Year
8 12 1993
8 12 1993
8 12 1993
I want to merge it in one column
Date
8/12/1993
8/12/1993
8/12/1993
I tried
df_date = df.Timestamp((df_filtered.Year*10000+df_filtered.Month*100+df_filtered.Day).apply(str),format='%Y%m%d')
I get this error
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
Using pd.to_datetime with astype(str)
1. as string type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y').dt.strftime('%d/%m/%Y')
Month Day Year Date
0 8 12 1993 08/12/1993
1 8 12 1993 08/12/1993
2 8 12 1993 08/12/1993
2. as datetime type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y')
Month Day Year Date
0 8 12 1993 1993-12-08
1 8 12 1993 1993-12-08
2 8 12 1993 1993-12-08
Here is the solution:
df = pd.DataFrame({'Month': [8, 8, 8], 'Day': [12, 12, 12], 'Year': [1993, 1993, 1993]})
# This way dates will be a DataFrame
dates = df.apply(lambda row:
pd.Series(pd.Timestamp(row[2], row[0], row[1]), index=['Date']),
axis=1)
# And this way dates will be a Series:
# dates = df.apply(lambda row:
# pd.Timestamp(row[2], row[0], row[1]),
# axis=1)
apply method generates a new Series or DataFrame iteratively applying provided function (lambda in this case) and joining the results.
You can read about apply method in official documentation.
And here is the explanation of lambda expressions.
EDIT:
#JohnClements suggested a better solution, using pd.to_datetime method:
dates = pd.to_datetime(df).to_frame('Date')
Also, if you want your output to be a string, you can use
dates = df.apply(lambda row: f"{row[2]}/{row[0]}/{row[1]}",
axis=1)
You can try:
df = pd.DataFrame({'Month': [8,8,8], 'Day': [12,12,12], 'Year': [1993, 1993, 1993]})
df['date'] = pd.to_datetime(df)
Result:
Month Day Year date
0 8 12 1993 1993-08-12
1 8 12 1993 1993-08-12
2 8 12 1993 1993-08-12
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
Month 3 non-null int64
Day 3 non-null int64
Year 3 non-null int64
date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 176.0 bytes

Resources