Trouble selecting entire day from DateTime object in pandas DataFrame? - python-3.x

I am having trouble selecting an entire day from a DateTime column in my DataFrame.
I originally started with the date and time in separate columns, and it was a simple thing to select all rows containing a specific date. One DateTime column seemed more convenient, but I have not been able to find out how to select all entries for a specific date. When I don't specify a time, I get an empty DataFrame.
#minimal example
import pandas as pd
df = pd.DataFrame({'date_time':['2019-01-01 00:00:00', '2019-01-01 00:00:00', '2019-01-01 00:30:00', '2019-01-01 01:00:00']})
I can select specific times no problem:
df[df.date_time == '2019-01-01 00:00:00']
But this gives an empty DataFrame:
df[df.date_time == '2019-01-01']
What I want it to return is every entry that has the specified date, regardless of the time.

First convert datetime into date
Then convert it in string
df['date_time'] = pd.to_datetime(df['date_time']).dt.date.astype(str)
df[df['date_time'] == '2019-01-01']

Your df has date_time as 'object'. You should first convert it to 'date_time' with
df.date_time = pd.to_datetime(df.date_time)
This will do the trick. If you try now:
df[df.date_time == '2019-01-01']
you'll get your desired result (you will; note 2 records because they both comes at 00:00):
date_time
0 2019-01-01
1 2019-01-01
However, if you want to completely ignore the time, you should add this:
df.date_time = pd.to_datetime(df.date_time)
df['date'] = pd.to_datetime(df['date_time'].dt.date)
df[df.date == '2019-01-01']
to make sure your time is out and only then do:
df[df.date == '2019-01-01']
and the desired result:
date_time date
0 2019-01-01 2019-01-01
1 2019-01-01 2019-01-01
2 2019-01-01 2019-01-01
3 2019-01-01 2019-01-01

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

is there any method in pandas to convert dataframe from day to defaullt d/m/y format?

I would like to convert all day in the data-frame into day/feb/2020 format
here date field consist only day
from first one convert the date field like this
My current approach is:
import datetime
y=[]
for day in planned_ds.Date:
x=datetime.datetime(2020, 5, day)
print(x)
Is there any easy method to convert all day data-frame to d/m/y format?
One way as assuming you have data like
df = pd.DataFrame([1,2,3,4,5], columns=["date"])
is to convert them to dates and then shift them to start when you need them to:
pd.to_datetime(df["date"], unit="D") - pd.datetime(1970,1,1) + pd.datetime(2020,1,31)
this results in
0 2020-02-01
1 2020-02-02
2 2020-02-03
3 2020-02-04
4 2020-02-05

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Resources