Cannot convert object to date after groupby - python-3.x

I have been successful with converting while working with a different dataset a couple days ago. However, I cannot apply the same technique to my current dataset. The set looks as:
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
Datez Volumez
0 2016-09-19 6.300000e+07
1 2016-09-20 3.382694e+07
2 2016-09-26 4.000000e+05
3 2016-09-27 4.900000e+09
4 2016-09-28 5.324995e+08
totalHist.dtypes
Datez object
Volumez float64
dtype: object
This used to do the trick:
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'], format='%d-%m-%Y')
totalHist.dtypes
which now is giving me:
KeyError: 'Datez'
During handling of the above exception, another exception occurred:
How can I fix this? I am doing this groupby before trying:
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
totalHist.head()
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()

You can just use .rename() to rename your columns
Generate some data (in same format as OP)
d = ['1/1/2018','1/2/2018','1/3/2018',
'1/3/2018','1/4/2018','1/2/2018','1/1/2018','1/5/2018']
df = pd.DataFrame(d, columns=['Date'])
df['Trading_Value'] = [1000,1005,1001,1001,1002,1009,1010,1002]
print(df)
Date Trading_Value
0 1/1/2018 1000
1 1/2/2018 1005
2 1/3/2018 1001
3 1/3/2018 1001
4 1/4/2018 1002
5 1/2/2018 1009
6 1/1/2018 1010
7 1/5/2018 1002
GROUP BY
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
print(totalHist.head())
Date Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Rename columns
totalHist.rename(columns={'Date':'Datez','totalHist':'Volumez'}, inplace=True)
print(totalHist)
Datez Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Finally, convert to datetime
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'])
print(totalHist.dtypes)
Datez datetime64[ns]
Trading_Value int64
dtype: object
This was done with python --version = 3.6.7 and pandas (0.23.4).

Related

Groupby dates quaterly in a pandas dataframe and find count for their occurence

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

Convert 6 digits date format to standard one in Pandas

I'm working with a dataframe has one messy date column with irregular format, ie:
date
0 19.01.01
1 19.02.01
2 1991/01/01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Is it possible convert it to standard format XXXX-XX-XX, which represents year-month-date? Thank you.
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Use pd.to_datetime with yearfirst=True
Ex:
df = pd.DataFrame({"date": ['19.01.01', '19.02.01', '1991/01/01', '1996-01-01', '1996-06-30', '1995-12-31', '1997-01-01']})
df['date'] = pd.to_datetime(df['date'], yearfirst=True).dt.strftime("%Y-%m-%d")
print(df)
Output:
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
It depends of format, the most general solution is specify each format and use Series.combine_first:
date1 = pd.to_datetime(df['date'], format='%y.%m.%d', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%Y/%m/%d', errors='coerce')
date3 = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
df['date'] = date1.combine_first(date2).combine_first(date3)
print (df)
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Try the following
df['date'].replace('\/|.','-', regex=True)
Use pd.to_datetime()
pd.to_datetime(df['date])
Output:
0 2001-01-19
1 2001-02-19
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Name: 0, dtype: datetime64[ns]

manipulating pandas dataframe - conditional

I have a pandas dataframe that looks like this:
ID Date Event_Type
1 01/01/2019 A
1 01/01/2019 B
2 02/01/2019 A
3 02/01/2019 A
I want to be left with:
ID Date
1 01/01/2019
2 02/01/2019
3 02/01/2019
Where my condition is:
If the ID is the same AND the dates are within 2 days of each other then drop one of the rows.
If however the dates are more than 2 days apart then keep both rows.
How do I do this?
I believe you need first convert values to datetimes by to_datetime, then get diff and get first values per groups by isnull() chained with comparing if next values are higher like timedelta treshold:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
s = df.groupby('ID')['Date'].diff()
df = df[(s.isnull() | (s > pd.Timedelta(2, 'd')))]
print (df)
ID Date Event_Type
0 1 2019-01-01 A
2 2 2019-02-01 A
3 3 2019-02-01 A
Check solution with another data:
print (df)
ID Date Event_Type
0 1 01/01/2019 A
1 1 04/01/2019 B <-difference 3 days
2 2 02/01/2019 A
3 3 02/01/2019 A
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
s = df.groupby('ID')['Date'].diff()
df = df[(s.isnull() | (s > pd.Timedelta(2, 'd')))]
print (df)
ID Date Event_Type
0 1 2019-01-01 A
1 1 2019-01-04 B
2 2 2019-01-02 A
3 3 2019-01-02 A

grouping by weekly days pandas

I have a dataframe,df containing
Index Date & Time eventName eventCount
0 2017-08-09 ABC 24
1 2017-08-09 CDE 140
2 2017-08-10 CDE 150
3 2017-08-11 DEF 200
4 2017-08-11 ABC 20
5 2017-08-16 CDE 10
6 2017-08-16 ABC 15
7 2017-08-17 CDE 10
8 2017-08-17 DEF 50
9 2017-08-18 DEF 80
...
I want to sum the eventCount for each weekly day occurrences and plot for the total events for each weekly day(from MON to SUN) i.e. for example:
Summation of the eventCount values of:
2017-08-09 and 2017-08-16(Mondays)=189
2017-08-10 and 2017-08-17(Tuesdays)=210
2017-08-16 and 2017-08-23(Wednesdays)=300
I have tried
dailyOccurenceSum=df['eventCount'].groupby(lambda x: x.weekday).sum()
and I get this error:AttributeError: 'int' object has no attribute 'weekday'
Starting with df -
df
Index Date & Time eventName eventCount
0 0 2017-08-09 ABC 24
1 1 2017-08-09 CDE 140
2 2 2017-08-10 CDE 150
3 3 2017-08-11 DEF 200
4 4 2017-08-11 ABC 20
5 5 2017-08-16 CDE 10
6 6 2017-08-16 ABC 15
7 7 2017-08-17 CDE 10
8 8 2017-08-17 DEF 50
9 9 2017-08-18 DEF 80
First, convert Date & Time to a datetime column -
df['Date & Time'] = pd.to_datetime(df['Date & Time'])
Next, call groupby + sum on the weekday name.
df = df.groupby(df['Date & Time'].dt.weekday_name)['eventCount'].sum()
df
Date & Time
Friday 300
Thursday 210
Wednesday 189
Name: eventCount, dtype: int64
If you want to sort by weekday, convert the index to categorical and call sort_index -
cat = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
df.index = pd.Categorical(df.index, categories=cat, ordered=True)
df = df.sort_index()
df
Wednesday 189
Thursday 210
Friday 300
Name: eventCount, dtype: int64

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources