How to retain feature columns after running dfs on entity? - featuretools

I tried Featuretools example mentioned at following URL: https://docs.featuretools.com/index.html
Customers dataframe has following data:
In [4]: customers_df
Out[4]:
customer_id zip_code join_date date_of_birth
0 1 60091 2011-04-17 10:48:33 1994-07-18
1 2 13244 2012-04-15 23:31:04 1986-08-18
After creating a feature matrix for each customer in the data, there are around 73 features created however, the features/columns join_date and date_of_birth are not retained in feature_matrix_customers.
Query:
1) Is there an option to retain the features/columns join_date and date_of_birth in feature_matrix_customers
2) Featuretools DFS does not extract time from join_date and does not create any features for hours, mins and secs. Is there a way to have features for hours, mins, secs similar to year, month and date feature columns

To extract other features related to the date, you need to include additional transform primitives to your call to ft.dfs.
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
features = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=["count", "sum", "mode"],
trans_primitives=["day", "hour", "weekend", "month", "year"],
features_only=True)
I used the features_only parameter so this only returns feature definitions. The features variable looks like this now
[<Feature: zip_code>,
<Feature: COUNT(transactions)>,
<Feature: DAY(date_of_birth)>,
<Feature: WEEKEND(join_date)>,
<Feature: COUNT(sessions)>,
<Feature: WEEKEND(date_of_birth)>,
<Feature: HOUR(date_of_birth)>,
<Feature: DAY(join_date)>,
<Feature: MODE(sessions.device)>,
<Feature: SUM(transactions.amount)>,
<Feature: YEAR(join_date)>,
<Feature: HOUR(join_date)>,
<Feature: YEAR(date_of_birth)>,
<Feature: MONTH(join_date)>,
<Feature: MONTH(date_of_birth)>,
<Feature: MODE(transactions.product_id)>,
<Feature: MODE(sessions.MODE(transactions.product_id))>,
<Feature: MODE(sessions.MONTH(session_start))>,
<Feature: MODE(sessions.DAY(session_start))>,
<Feature: MODE(sessions.YEAR(session_start))>,
<Feature: MODE(sessions.HOUR(session_start))>]
Featuretools only returns numeric and categorical features, so we have to manually add the datetime features in like this
features += [ft.Feature(es["customers"]["join_date"]), ft.Feature( es["customers"]["date_of_birth"])]
Now, we can calculate the features on actual data
fm = ft.calculate_feature_matrix(entityset=es, features=features)
this returns which as the join_date and date_of_birth at the end of the dataframe
zip_code COUNT(transactions) DAY(date_of_birth) WEEKEND(join_date) COUNT(sessions) WEEKEND(date_of_birth) HOUR(date_of_birth) DAY(join_date) MODE(sessions.device) SUM(transactions.amount) YEAR(join_date) HOUR(join_date) YEAR(date_of_birth) MONTH(join_date) MEAN(transactions.amount) MODE(transactions.product_id) MONTH(date_of_birth) MEAN(sessions.COUNT(transactions)) MODE(sessions.MODE(transactions.product_id)) MEAN(sessions.MEAN(transactions.amount)) MODE(sessions.MONTH(session_start)) MODE(sessions.DAY(session_start)) MEAN(sessions.SUM(transactions.amount)) MODE(sessions.YEAR(session_start)) MODE(sessions.HOUR(session_start)) SUM(sessions.MEAN(transactions.amount)) join_date date_of_birth
customer_id
1 60091 126 18 True 8 False 0 17 mobile 9025.62 2011 10 1994 4 71.631905 4 7 15.750000 4 72.774140 1 1 1128.202500 2014 6 582.193117 2011-04-17 10:48:33 1994-07-18
2 13244 93 18 True 7 False 0 15 desktop 7200.28 2012 23 1986 4 77.422366 4 8 13.285714 3 78.415122 1 1 1028.611429 2014 3 548.905851 2012-04-15 23:31:04 1986-08-18
3 13244 93 21 True 6 False 0 13 desktop 6236.62 2011 15 2003 8 67.060430 1 11 15.500000 1 67.539577 1 1 1039.436667 2014 5 405.237462 2011-08-13 15:42:34 2003-11-21
4 60091 109 15 False 8 False 0 8 mobile 8727.68 2011 20 2006 4 80.070459 2 8 13.625000 1 81.207189 1 1 1090.960000 2014 1 649.657515 2011-04-08 20:08:14 2006-08-15
5 60091 79 28 True 6 True 0 17 mobile 6349.66 2010 5 1984 7 80.375443 5 7 13.166667 3 78.705187 1 1 1058.276667 2014 0 472.231119 2010-07-17 05:27:50 1984-07-28

Related

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

How to convert multi-indexed datetime index into integer?

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

Search value in Next Month Record Pandas

Given that i have a df like this:
ID Date Amount
0 a 2014-06-13 12:03:56 13
1 b 2014-06-15 08:11:10 14
2 a 2014-07-02 13:00:01 15
3 b 2014-07-19 16:18:41 22
4 b 2014-08-06 09:39:14 17
5 c 2014-08-22 11:20:56 55
...
129 a 2016-11-06 09:39:14 12
130 c 2016-11-22 11:20:56 35
131 b 2016-11-27 09:39:14 42
132 a 2016-12-11 11:20:56 18
I need to create a column df['Checking'] to show that ID will appear in next month or not and i tried the code as below:
df['Checking']= df.apply(lambda x: check_nextmonth (x.Date,
x.ID), axis=1)
where
def check_nextmonth(date, id)=
x= id in df['user_id'][df['Date'].dt.to_period('M')== ((date+
relativedelta(months=1))).to_period('M')].values
return x
but it take too long to process a single row.
How can i improve this code or another way to achieve what i want?
Using pd.to_datetime with ts tricks:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df['tmp'] = (df['Date'] - pd.DateOffset(months=1)).dt.month
s = df.groupby('ID').apply(lambda x:x['Date'].dt.month.isin(x['tmp']))
df['Checking'] = s.reset_index(level=0)['Date']
Output:
ID Date Amount tmp Checking
0 a 2014-06-13 12:03:56 13 5 True
1 b 2014-06-15 08:11:10 14 5 True
2 a 2014-07-02 13:00:01 15 6 False
3 b 2014-07-19 16:18:41 16 6 True
4 b 2014-08-06 09:39:14 17 7 False
5 c 2014-08-22 11:20:56 18 7 False
Here's one method of doing it, check if the grouped id's next month is equal to current month + 1, and assign the same by sorting the ID.
check = df.groupby('ID').apply(lambda x : x['Date'].dt.month.shift(-1) == x['Date'].dt.month+1).stack().values
df = df.sort_values('ID').assign( checking = check).sort_index()
ID Date Amount checking
0 a 2014-06-13 12:03:56 13 True
1 b 2014-06-15 08:11:10 14 True
2 a 2014-07-02 13:00:01 15 False
3 b 2014-07-19 16:18:41 16 True
4 b 2014-08-06 09:39:14 17 False
5 c 2014-08-22 11:20:56 18 False

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources