Pandas "<=" operator unexpected behaviour for datetime - python-3.x

I do have an analytics event table with column event_datetime
I want to filter out pandas table for only 2 days, for example, Feb 1st and 2nd:
event_data.loc[(event_data['event_datetime'] >= '2023-02-01') &(event_data['event_datetime'] <= '2023-02-02')]
But this code does return events for 2023-02-01 only, although I wrote less of equal '2023-02-02'.
In sql it works fine. Do I miss something? Did not find anything about it in pandas docs...

Thanks to #Galo do Leste and #Nick ODell hint, the problem was with datetime format of the column - it was not properly converted into str format.
So if column has datetime type - operator <= doesnt work as it expected.
After proper column casting to string format
event_data['event_datetime'] = event_data['event_datetime'].dt.strftime('%Y-%m-%d')
it works fine.

Related

How to convert DB2 string column with different formats of datetime to datetime?

I have a varchar column [DB_TIMESTAMP] in a (DB2) table which get data from different sources/environments, This column have different formats in it like:
11/15/2019 11:30:02
11/15/2019 11:22 AM
2019/11/15 11:15 AM
I have to put remarks using CASE in my query to find if there is any row that has 2 hours delay from current DateTime with this column data then mark them pending.
I tried like following, but it need the column with DateTime format which it is not because of different format of data entered in it:
CASE WHEN days (current date) - days(DB_TIMESTAMP))>2
[for checking 2 hours difference]
I think, this column needs to be converted into DateTime then try with above may work, but how:
Please help.
Shamshad Ali
Try Something May it helps:
CASE WHEN DAYS (Replace (CONVERT(nvarchar (500), CURRENT_DATE ,106),' ','-') as current_date)
- DAYS(Replace (CONVERT(nvarchar (500), DB_TIMESTAMP ,106),' ','-') as DB_TIMESTAMP))>2

Date Manipulation and Comparisons Python,Pandas and Excel

I have a datetime column[TRANSFER_DATE] in an excel sheet shows dates formated as
1/4/2019 0:45 when this date is selected, in it appears as
01/04/2019 00:45:08 am using a python scrip to read this column[TRANSFER_DATE] which shows the datetime as 01/04/2019 00:45:08
However when i try to compare the column[TRANSFER_DATE] whith another date, I get this error
Can only use .dt accessor with datetimelike "
ValueError: : "Can only use .dt accessor with datetimelike values" while evaluating
implying those values are not actually recognized as datetime values
mask_part_date = data.loc[data['TRANSFER_DATE'].dt.date.astype(str) == '2019-04-12']
As seen in this question, the Excel import might have silently failed for some of the values in the column. If you check the column type with:
data.dtypes
it might show as object instead of datetime64.
If you force your column to have datetime values, that might solve your issue:
data['TRANSFER_DATE'] = pd.to_datetime(data['TRANSFER_DATE'], errors='coerce')
You will spot the non-converted values as NaT and you can debug those manually.
Regarding your comparison, after the dataframe conversion to datetime objects, this might be more efficient:
mask_part_date = data.loc[data['TRANSFER_DATE'] == pd.Timestamp('2019-04-12')]

Creating new column using for loop returns NaN value in Python/Pandas

I am using Python/Pandas to manipulate a data frame. I have a column 'month' (values from 1.0 to 12.0). Now I want to create another column 'quarter'. When I write -
for x in data['month']:
print ((x-1)//3+1)
I get proper output that is quarter number (1,2,3,4 etc).
But I am not being able to assign the output to the new column.
for x in data['month']:
data['quarter'] = ((x-1)//3 + 1)
This creates the quarter column with missing or 'NaN' value -
My question is why I am getting missing value while creating the column ?
Note: I am using python 3.6 and Anaconda 1.7.0. 'data' is the data frame I am using. Initially I had only the date which I converted to month and year using
data['month'] = pd.DatetimeIndex(data['first_approval']).month
Interestingly this month column shows dtype: float64 . I have read somewhere "dtype('float64') is equivalent to None" but I didn't understand that statement clearly. Any suggestion or help will be highly appreciated.
This is what I had in the beginning:
This is what I am getting after running the for loop:
The easiest way to get the quarter from the date would be to
data['quarter'] = pd.DatetimeIndex(data['date']).quarter
the same way as how you achieved the month information.
The below line would set the entire column to the last value achieved from the calculation. (There could have been some value which is not of a proper date format, hence the NaNs)
data['quarter'] = ((x-1)//3 + 1)
Try with below:
df['quarter'] = df['month'].apply(lambda x: ((x-1)//3 + 1))

SAS: Date reading issue

I have imported an excel sheet where the date1 is 4/1/16 date2 is 5/29/14 and date3 is 5/2/14. However, when I import the sheet into SAS and do PROC PRINT gives the first 2 variable columns as "42461" and "41788" while the date3 is 05/02/2014.
I need these date formats consistent b/c I am doing a Cox regression with PROC PHREG.
Any thoughts about how to make these dates consistent?
Thanks!
This probably depends on how the data is represented in Excel and how it is imported into SAS. First, are the formats the same in Excel? The first two are being imported as a number. The second as a string.
In Excel, you can format the column using a date format. Perhaps your import method will recognize this. You can also define another column as a string, using the text(<whatever>, "YYYY-MM-DD") to convert to a string in that format.
Alternatively, you can import all as numbers and then add the value to 1899-12-31. That is the base date for Excel. This makes more sense if you think of "1" as being 1900-01-01.
Because your column had mixed numeric (date) and character values SAS imported the field as character. So the actual dates got imported as the text version of the actual number that Excel stores for dates. The ones that look like date strings in SAS are the fields that were strings in Excel also.
Or if in your case one of the three columns was all valid dates then SAS imported it as a number and assigned a date format to it so there is nothing to fix for that column.
The best way to fix it is to make sure that all of the values in the date column are either real dates or empty cells. Then PROC IMPORT will be able to make the right guess at how to import it.
Once you have the strings in SAS and you want to try to fix them then you need to decide which strings look like integers and which should be treated as date strings.
So you might just check if they have any non-digit characters and assume those are the ones that are date strings instead of numbers. For the ones that look like integers just adjust the number to account for the fact that Excel numbers dates from 1900 and SAS numbers them from 1960.
data want ;
set have ;
if missing(exel_string) then date=.;
else if notdigit(trim(excel_string)) then date=input(excel_string,anydtdte32.);
else date=input(excel_string,32.) + '01JAN1900'd -2 ;
format date yymmdd10. ;
run;
You might wonder why the minus 2? It is because Excel starts from 1 instead of 0 and also because Excel thinks 1900 was a leap year. Here are the Excel date numbers for some key dates and a little SAS program to convert them. Try it.
data excel_dates;
input datestr :$10. excel_num :comma32. #1 sas_num :yymmdd10. ;
diff = sas_num - excel_num ;
format _numeric_ comma14. ;
sasdate1 = excel_num - 21916;
sasdate2 = excel_num + '01JAN1900'd -2 ;
format sasdate: yymmdd10.;
cards;
1900-01-01 1
1900-02-28 59
1900-03-01 61
1960-01-01 21,916
2018-01-01 43,101
;

Pandas: Calling df.loc[] from an index consisting of pd.datetime

Say I have a df as follows:
a=pd.DataFrame([[1,3]]*3,columns=['a','b'],index=['5/4/2017','5/6/2017','5/8/2017'])
a.index=pd.to_datetime(a.index,format='%m/%d/%Y')
The type of of the df.index is now
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
When we try to call a row of data based on the index of type pd.datetime, it is possible to call the value based on a string format of date instead of inputting a datetime object. In the above case, if I want to call a row of data on 5/4/2017, I can simply input the string format of the date to .loc as follows:
print(a.loc['5/4/2017'])
And we do not need to input the datetime object
print(a.loc[pd.datetime(2017,5,4)]
My question is, when calling the data from .loc based on string format of date, how does pandas know if my date string format follows m-d-y or d-m-y or other combinations? In this above case, I used a.loc['5/4/2017'] and it succeeds in returning the value. Why wouldn't it think it might mean April 5 which is not within this index?
Here's my best shot:
Pandas has an internal function called pandas._guess_datetime_format. This is what gets called when passing the 'infer_datetime_format' argument to pandas.to_datetime. It takes a string and runs through a list of "guess" formats and returns its best guess on how to convert that string to a datetime object.
Referencing a datetime index with a string may use a similar approach.
I did some testing to see what would happen in the case you described - where a dataframe contains both the date 2017-04-05 and 2017-05-04.
In this case, the following:
df.loc['5/4/2017']
Returned the Data for May 4th, 2017
df.loc['4/5/2017']
Returned the data for April 5th, 2017.
Attempting to reference 4/5/2017 in your original matrix gave an "is not in the [index]" error.
Based on this, my conclusion is that pandas._guess_datetime_format defaults to a "%m/%d/%Y" format in cases where it cannot be distinguished from "%d/%m/%Y". This is the standard date format in the US.

Resources