Pandas and datetime coercion. Can't convert whole column to Timestamp - python-3.x

So, I have an issue. Pandas keeps telling me that
'datetime.date' is coerced to a datetime.
In the future pandas will not coerce, and a TypeError will be raised. To >retain the current behavior, convert the 'datetime.date' to a datetime with >'pd.Timestamp'.
I'd like to get rid of this warning
So until now I had a dataframe with some data, I was doing some filtration and manipulation. At some point I have a column with dates in string format. I don't care about timzeones etc. It's all about day accuracy. I'm getting a warning mentioned above, when I convert the strings to datetime, like below:
df['Some_date'] = pd.to_datetime(df['Some_date'], format='%m/%d/%Y')
So I tried to do something like that:
df['Some_date'] = pd.Timestamp(df['Some_date'])
But it fails as pd.Timestamp doesn't accept Series as an argument.
I'm looking for a quickest way to convert those strings to Timestamp.
=====================================
EDIT
I'm so sorry, for confusion. I'm getting my error at another place. It happens when I try to filtrate my data like this:
df = df[(df['Some_date'] > firstday)]
Where firstday is being calculated basing on datetime. Like here:
import datetime
def get_dates_filter():
lastday = datetime.date.today().replace(day=1) - datetime.timedelta(days=1)
firstday = lastday.replace(day=1)
return firstday, lastday
So probably the issue is comparing two different types of date representation

In pandas python dates are still poor supported, the best is working with datetimes with no times.
If there are python dates you can convert to strings before to_datetime:
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str))
If need remove times from datetimes in column use:
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str)).dt.floor('d')
Test:
rng = pd.date_range('2017-04-03', periods=3).date
df = pd.DataFrame({'Some_date': rng})
print (df)
Some_date
0 2017-04-03
1 2017-04-04
2 2017-04-05
print (type(df.loc[0, 'Some_date']))
<class 'datetime.date'>
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str))
print (df)
Some_date
0 2017-04-03
1 2017-04-04
2 2017-04-05
print (type(df.loc[0, 'Some_date']))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
print (df['Some_date'].dtype)
datetime64[ns]

Related

Converting year-month to next year-quarter

I have below date expressed as yearmon '202112'
I want to convert this to yearqtr and report the next quarter. Therefore from above string I want to get 2022Q1
I unsuccessfully tried below
import pandas as pd
pd.PeriodIndex(pd.to_datetime('202112') ,freq='Q')
Could you please help how to obtain the expected quarter. Any pointer will be veru helpful
import pandas as pd
df = pd.DataFrame({"Date": ['202112']}) # dummy data
df['next_quarter'] = pd.PeriodIndex(pd.to_datetime(df['Date'], format='%Y%m'), freq='Q') + 1
print(df)
Output:
Date next_quarter
0 202112 2022Q1
Note that column Date may be a string type but Quarter will be type period. You can convert it to a string if that's what you want.
I think one issue you're running into is that '202112' is not a valid date format. You'll want to use '2021-12'. Then you can do something like this:
pd.to_datetime('2021-12').to_period('Q') + 1
You can convert your date to this new format by simply inserting a - at index 4 of your string like so: date[:4] + '-' + date[4:]
This will take your date, convert it to quarters, and add 1 quarter.

Float is not converting to integer pandas

I used this code to convert my float numbers into an integer, however, it does not work. Here are all step I gone through so far:
Step 1: I converted timestamp1 and timestamp2 to datetime in order subtract and get days:
a=pd.to_datetime(df['timestamp1'], format='%Y-%m-%dT%H:%M:%SZ')
b=pd.to_datetime(df['timestamp2'], format='%Y-%m-%dT%H:%M:%SZ')
df['delta'] = (b-a).dt.days
Step 2: Converted the strings into integers as the day:
df['delta'] = pd.to_datetime(df['delta'], format='%Y-%m-%d', errors='coerce')
df['delta'] = df['delta'].dt.day
Step 3: I am trying to convert floats into integers.
categorical_feature_mask = df.dtypes==object
categorical_cols = df.columns[categorical_feature_mask].tolist()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df[categorical_cols].head(10)
However, it throws an error TypeError: ('argument must be a string or number', 'occurred at index col1')
To convert a float column to an integer with float columns having NaN values two things you can do:
Convert to naive int and change NaN values to an arbitrary value such as this:
df[col].fillna(0).astype("int32")
If you want to conserve NaN values use this:
df[col].astype("Int32")
Note the difference with the capital "I". For further information on this implementation made by Pandas just look at this: Nullable integer data type.
Why do you need to do that ? Because by default Pandas considers that when your column has at least on NaN value, the column is a Float, because this is how numpy behaves.
The same thing happen with strings, if you have at least one string value in your column, the whole column would be labeled as object for Pandas, so this is why your first attempt failed.
You can convert columns from float to int using this. Use errors='ignore' if the data contains null values
df[column_name] = df[column_name].astype("Int64", errors="ignore")

Date Manipulation and Comparisons Python,Pandas and Excel

I have a datetime column[TRANSFER_DATE] in an excel sheet shows dates formated as
1/4/2019 0:45 when this date is selected, in it appears as
01/04/2019 00:45:08 am using a python scrip to read this column[TRANSFER_DATE] which shows the datetime as 01/04/2019 00:45:08
However when i try to compare the column[TRANSFER_DATE] whith another date, I get this error
Can only use .dt accessor with datetimelike "
ValueError: : "Can only use .dt accessor with datetimelike values" while evaluating
implying those values are not actually recognized as datetime values
mask_part_date = data.loc[data['TRANSFER_DATE'].dt.date.astype(str) == '2019-04-12']
As seen in this question, the Excel import might have silently failed for some of the values in the column. If you check the column type with:
data.dtypes
it might show as object instead of datetime64.
If you force your column to have datetime values, that might solve your issue:
data['TRANSFER_DATE'] = pd.to_datetime(data['TRANSFER_DATE'], errors='coerce')
You will spot the non-converted values as NaT and you can debug those manually.
Regarding your comparison, after the dataframe conversion to datetime objects, this might be more efficient:
mask_part_date = data.loc[data['TRANSFER_DATE'] == pd.Timestamp('2019-04-12')]

Cannot convert object as strings to int - Unable to parse string

I have a data frame with one column denoting range of Ages. The data type of the Age column in shown as string. I am trying to convert string values to numeric for the model to interpret the features.
I tried the following to convert to 'int'.
df.Age = pd.to_numeric(df.Age)
I get the following error:
ValueError: Unable to parse string "0-17" at position 0
I also tried using the 'errors = coerce' parameter but it gave me a different error:
df.Age = pd.to_numeric(df.Age, errors='coerce').astype(int)
Error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
But there are no NA values in any column in my df
Age seems to be a categorical variable, so you should treat it as such. pandas has a neat category dtype which converts your labels to integers under the hood:
df['Age'] = df['Age'].astype('category')
Then you can access the underlying integers usin the cat accessor method
codes = df['Age'].cat.codes # This returns integers
Also you probably want to make Age an ordered categorical variable, for which you can also find a neat recipe in the docs.
from pandas.api.types import CategoricalDtype
age_category = CategoricalDtype([...your labels in order...], ordered=True)
df['Age'] = df['Age'].astype(age_category)
Then you can acces the underlying codes in the same way and be sure that they will reflect the order you entered for your labels.
At first glance, I would say it is because you are attempting to convert a string that has not only an int in it. Your string is "0-17", which is not an integer. If it had been "17" or "0", the conversion would have worked.
val = int("0")
val = int("17")
I have no idea what your to_numeric method is, so I am not sure if I am answering your question.
Why don't you split
a=df["age"].str.split("-", n=2, expand=True)
df['age_from']=a[0].to_frame()
df['age_to']=a[1].to_frame()
Here is what I got at the end!
date age
0 2018-04-15 12-20
1 2018-04-15 2-30
2 2018-04-18 5-46+
date age age_from age_to
0 2018-04-15 12-20 12 20
1 2018-04-15 2-30 2 30
2 2018-04-18 5-46+ 5 46+

pandas to_datetime formatting

I am trying to compare a pandas to_datetime object to another to_datetime object. In both locations, I am entering the date as date = pd.to_datetime('2017-01-03'), but when I run a print statement on each, in one case I get 2017-01-03, but in another I get 2017-01-03 00:00:00. This causes a problem because if I use an if statement comparing them such as if date1 == date2: they will not compare as equal, when in reality they are. Is there a format statement that I can use to force the to_datetime() command to yield the 2017-01-03 format?
You can use date() method to just select date from pandas timestamp and also use strftimeme(format) method to convert it into string with different formats.
date = pd.to_datetime('2017-01-03').date()
print(date)
>datetime.date(2017, 1, 3)
or
date = pd.to_datetime('2017-01-03').strftime("%Y-%m-%d")
print(date)
>'2017-01-03'
try .date()
pd.to_datetime('2017-01-03').date()
You can use
pd.to_datetime().date()
For example:
a='2017-12-24 22:44:09'
b='2017-12-24'
if pd.to_datetime(a).date() == pd.to_datetime(b).date():
print('perfect')

Resources