Groupby dates quaterly in a pandas dataframe and find count for their occurence - python-3.x

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'

First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1

You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

Related

How to check if dates in a pandas column are after a date

I have a pandas dataframe
date
0 2010-03
1 2017-09-14
2 2020-10-26
3 2004-12
4 2012-04-01
5 2017-02-01
6 2013-01
I basically want to filter where dates are after 2015-12 (Dec 2015)
To get this:
date
0 2017-09-14
1 2020-10-26
2 2017-02-01
I tried this
df = df[(df['date']> "2015-12")]
but I'm getting an error
ValueError: Wrong number of items passed 17, placement implies 1
First for me working solution correct:
df = df[(df['date']> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
If convert to datetimes, which should be more robust for me working too:
df = df[(pd.to_datetime(df['date'])> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
Detail:
print (pd.to_datetime(df['date']))
0 2010-03-01
1 2017-09-14
2 2020-10-26
3 2004-12-01
4 2012-04-01
5 2017-02-01
6 2013-01-01
Name: date, dtype: datetime64[ns]

How to fill error date value with 0 in python

id date_original
1 20200305
2 2020305
3 2020035
4 202035
How can I convert the 'date_original' column into 'date' column in pandas dataframe?
id date
1 20200305
2 20200305
3 20200305
4 20200305
For me working well all formats if used format for match YYYYMMDD, tested in pandas 1.1.3:
df['date_original'] = pd.to_datetime(df['date_original'], format='%Y%m%d', errors='coerce')
print (df)
id date_original
0 1 2020-03-05
1 2 2020-03-05
2 3 2020-03-05
3 4 2020-03-05

Convert 6 digits date format to standard one in Pandas

I'm working with a dataframe has one messy date column with irregular format, ie:
date
0 19.01.01
1 19.02.01
2 1991/01/01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Is it possible convert it to standard format XXXX-XX-XX, which represents year-month-date? Thank you.
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Use pd.to_datetime with yearfirst=True
Ex:
df = pd.DataFrame({"date": ['19.01.01', '19.02.01', '1991/01/01', '1996-01-01', '1996-06-30', '1995-12-31', '1997-01-01']})
df['date'] = pd.to_datetime(df['date'], yearfirst=True).dt.strftime("%Y-%m-%d")
print(df)
Output:
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
It depends of format, the most general solution is specify each format and use Series.combine_first:
date1 = pd.to_datetime(df['date'], format='%y.%m.%d', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%Y/%m/%d', errors='coerce')
date3 = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
df['date'] = date1.combine_first(date2).combine_first(date3)
print (df)
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Try the following
df['date'].replace('\/|.','-', regex=True)
Use pd.to_datetime()
pd.to_datetime(df['date])
Output:
0 2001-01-19
1 2001-02-19
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Name: 0, dtype: datetime64[ns]

how to compare two data frames based in difference in date

I have two data frames, each has #id column and date column,
I want to find rows in both Data frames that have same id with a date difference more than > 2 days
Normally it's helpful to include a datafrme so that the responder doesn't need to create it. :)
import pandas as pd
from datetime import timedelta
Create two dataframes:
df1 = pd.DataFrame(data={"id":[0,1,2,3,4], "date":["2019-01-01","2019-01-03","2019-01-05","2019-01-07","2019-01-09"]})
df1["date"] = pd.to_datetime(df1["date"])
df2 = pd.DataFrame(data={"id":[0,1,2,8,4], "date":["2019-01-02","2019-01-06","2019-01-09","2019-01-07","2019-01-10"]})
df2["date"] = pd.to_datetime(df2["date"])
They will look like this:
DF1
id date
0 0 2019-01-01
1 1 2019-01-03
2 2 2019-01-05
3 3 2019-01-07
4 4 2019-01-09
DF2
id date
0 0 2019-01-02
1 1 2019-01-06
2 2 2019-01-09
3 8 2019-01-07
4 4 2019-01-10
Merge the two dataframes on 'id' columns:
df_result = df1.merge(df2, on="id")
Resulting in:
id date_x date_y
0 0 2019-01-01 2019-01-02
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09
3 4 2019-01-09 2019-01-10
Then subtract the two day columns and filter for greater than two.
df_result[(df_result["date_y"] - df_result["date_x"]) > timedelta(days=2)]
id date_x date_y
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09

Apply a value to max values in a groupby

I have a DF like this:
ID Time
1 20:29
1 20:45
1 23:16
2 11:00
2 13:00
3 01:00
I want to create a new column that puts a 1 next to the largest time value within each ID grouping like so:
ID Time Value
1 20:29 0
1 20:45 0
1 23:16 1
2 11:00 0
2 13:00 1
3 01:00 1
I know the answer involves a groupby mechanism and have been fiddling around with something like:
df.groupby('ID')['Time'].max() = 1
The idea is to write an anonymous function that operates on each of your groups and feed this to your groupby using apply:
df['Value']=df.groupby('ID',as_index=False).apply(lambda x : x.Time == max(x.Time)).values
Assuming that your 'Time' column is already a datetime64 then you want to groupby on 'ID' column and then call transform to apply a lambda to create a series with an index aligned with your original df:
In [92]:
df['Value'] = df.groupby('ID')['Time'].transform(lambda x: (x == x.max())).dt.nanosecond
df
Out[92]:
ID Time Value
0 1 2015-11-20 20:29:00 0
1 1 2015-11-20 20:45:00 0
2 1 2015-11-20 23:16:00 1
3 2 2015-11-20 11:00:00 0
4 2 2015-11-20 13:00:00 1
5 3 2015-11-20 01:00:00 1
The dt.nanosecond call is because the dtype returned is a datetime for some reason rather than a boolean:
In [93]:
df.groupby('ID')['Time'].transform(lambda x: (x == x.max()))
Out[93]:
0 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000000
2 1970-01-01 00:00:00.000000001
3 1970-01-01 00:00:00.000000000
4 1970-01-01 00:00:00.000000001
5 1970-01-01 00:00:00.000000001
Name: Time, dtype: datetime64[ns]

Resources