get rows by date regardless of format of date in pandas - python-3.x

I have data as follows:
Col1,ColDate
a,2020-09-11 08:43:00
b,2020-09-12 09:43:00
c,13-09-2020 09:43:00
d,09/16/2020 10:43:00
e,09/19/2020 12:43:00
f,09/12/2020 15:43:00
Intention is to get all rows between 11th sep and 13th sept, regardless of the format. In pandas
I am trying the following:
df[df["ColDate"].between('11-09-2020','13-09-2020')]
I get an empty dataframe.

You can try this,
df[pd.to_datetime(df['ColDate']).dt.strftime('%d-%m-%Y').between('11-09-2020','13-09-2020')]
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
2 c 13-09-2020 09:43:00
5 f 09/12/2020 15:43:00
but its really hard to say which will be considered month and day, because of the date format being jumbled.

Please Check the snippet. You can first convert your Coldate to pd.to_datetime format and then you can apply a mask over it like this.
df['ColDate'] = pd.to_datetime(df['ColDate'])
mask = (df['ColDate'] > '2020-09-11') & (df['ColDate'] <='2020-09-13')
df = df.loc[mask]
Output
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
5 f 2020-09-12 15:43:00

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

Extract row from pandas dateframe

I have a data frame as the image below. I want to extract the rows of data frame which are having year and month as '1395/01'. I used the code below, but I know it is not correct because we can use string slice on a series of strings. Can anyone show me a way without using nested for loops?
df[df['Date'][:7] == '1395/01']
I might use str.match here:
df[df['Date'].str.match(r'^1395/01')]
But in general it is usually preferable to store dates as datetime and not text. Also, the year 1395 seems dubious.
You can use loc and startswith to filter your dataframe.
Sample:
df = pd.DataFrame({'Date': ['1395/01/01', '1395/02/01', '1395/01/01', '1395/05/01']})
print(df)
Date
0 1395/01/01
1 1395/02/01
2 1395/01/01
3 1395/05/01
Solution:
print(df.loc[df['Date'].str.startswith('1395/01'), :])
Date
0 1395/01/01
2 1395/01/01
If you would like to extract year and month for all rows, you can use str.slice:
df['Extracted Date'] = df['Date'].str.slice(0, 7)
print(df)
Date Extracted Date
0 1395/01/01 1395/01
1 1395/02/01 1395/02
2 1395/01/01 1395/01
3 1395/05/01 1395/05

Sorting data frames with time in hours days months format

I have one data frame with time in columns but not sorted order, I want to sort in ascending order, can some one suggest any direct function or code for data frames sort time.
My input data frame:
Time
data1
1 month
43.391588
13 h
31.548372
14 months
41.956652
3.5 h
31.847388
Expected data frame:
Time
data1
3.5 h
31.847388
13 h
31.847388
1 month
43.391588
14 months
41.956652
You need replace units to numbers first by Series.replace and then convert to numeric by pandas.eval, last use this column for sorting by DataFrame.sort_values:
d = {' months': '*30*24', ' month': '*30*24', ' h': '*1'}
df['sort'] = df['Time'].replace(d, regex=True).map(pd.eval)
df = df.sort_values('sort')
print (df)
Time data1 sort
3 3.5 h 31.847388 3.5
1 13 h 31.548372 13.0
0 1 month 43.391588 720.0
2 14 months 41.956652 10080.0
Firstly you have to assert the type of data you have in your dataframe.
This will indicate how you may proceed.
df.dtypes or at your case df.index.dtypes .
Preferred option for sorting dataframes is df.sort_values()

How to calculate time difference between the rows groupby rowname and extract only the most recent ones?

I want to calculate the number of days between 2 rows with a grouby function and extract only 1 row with the latest date. I need not want all the rows with the same row value instead want the one which is more recent with the number of days as new column.
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-27 16:36:04
2 A 2016-11-29 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-28 16:35:46
I want the output as
id no of days
0 A 4(approx)
1 B 3(approx)
So what i want is only the column 2 with id A which has the most recent change in time and date and omit rest of rows.
IIUC
df.time=pd.to_datetime(df.time)
df.groupby('id').time.apply(lambda x : (x.max()-x.min()).days)
Out[1186]:
id
A 4
B 3
Name: time, dtype: int64

Resources