featuretools: manual derivation of the features generated by dfs? - featuretools

Code example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
# Normalized one more time
es = es.normalize_entity(
new_entity_id="device",
base_entity_id="sessions",
index="device",
)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=["std",],
groupby_trans_primitives=['cum_count'],
max_depth=2
)
And I'd like to look into STD(sessions.CUM_COUNT(device) by customer_id) feature deeper:
And I tried to generate this feature manually but had different result:
df = ft.demo.load_mock_customer(return_single_table=True)
a = df.groupby("customer_id")['device'].cumcount()
a.name = "cumcount_device"
a = pd.concat([df, a], axis=1)
b = a.groupby("customer_id")['cumcount_device'].std()
>>> b
customer_id
1 36.517
2 26.991
3 26.991
4 31.610
5 22.949
Name: cumcount_device, dtype: float64
What am I missing?

Thanks for the question. The calculation needs to be based on the data frame from sessions.
df = es['sessions'].df
cumcount = df['device'].groupby(df['customer_id']).cumcount()
std = cumcount.groupby(df['customer_id']).std()
std.round(3).loc[feature_matrix.index]
customer_id
5 1.871
4 2.449
1 2.449
3 1.871
2 2.160
dtype: float64
You should get the same output as in DFS.

Related

Aggregating using custom function and several colums in pandas

Suppose I have the following data frame:
group num value
a 3 20
a 5 5
b 5 10
b 10 5
b 2 25
Now, I want to compute the weighted average of columns num and value grouping by column group. Using tidyverse packages in R, this is straightforward:
> library(tidyverse)
> df <- tribble(
~group , ~num , ~value,
"a" , 3 , 20,
"a" , 5 , 5,
"b" , 5 , 10,
"b" , 10 , 5,
"b" , 2 , 25
)
> df %>%
group_by(group) %>%
summarise(new_value = sum(num * value) / sum(num))
# A tibble: 2 x 2
group new_value
<chr> <dbl>
1 a 10.6
2 b 8.82
Using Pandas in Python, I can make all intermediary computation beforehand, and then use sum() to sum up the variables, and then perform the division using transform() like this:
import pandas as pd
from io import StringIO
data = StringIO(
"""
group,num,value
a,3,20
a,5,5
b,5,10
b,10,5
b,2,25
""")
df = pd.read_csv(data)
df["tmp_value"] = df["num"] * df["value"]
df = df.groupby(["group"]) \
[["num", "tmp_value"]] \
.sum() \
.transform(lambda x : x["tmp_value"] / x["num"], axis="columns")
print(df)
# group
# a 10.625000
# b 8.823529
# dtype: float64
Note that we explicitly need first to subset the columns of interest ([["num", "tmp_value"]]), compute the sum (sum()), and then the average/division using transform(). In R, we write this in just one simple step, much more compact and readable, IMHO.
Now, how can I accomplish that elegancy using Pandas? In other words, can it be more clean, elegant, and mainly easy to read as we do in R?
#an_drade - There has been a very similar stackoverflow question that provides the solution:
Pandas DataFrame aggregate function using multiple columns
The solution to your question is based on the above post by creating a python function:
df=pd.DataFrame([['a',3,20],['a',5,5],['b',5,10],['b',10,5],['b',2,25]],columns=['group','num','value'])
def wavg(group):
d = group['num']
w = group['value']
return (d*w).sum() / d.sum()
final=df.groupby("group").apply(wavg)
group
a 10.625000
b 8.823529
dtype: float64
This is the "R way" you wanted:
>>> from datar import f
>>> from datar.tibble import tribble
>>> from datar.dplyr import group_by, summarise
>>> from datar.base import sum
>>> # or if you are lazy:
>>> # from datar.all import *
>>>
>>> df = tribble(
... f.group , f.num , f.value,
... "a" , 3 , 20,
... "a" , 5 , 5,
... "b" , 5 , 10,
... "b" , 10 , 5,
... "b" , 2 , 25
... )
>>> df >> \
... group_by(f.group) >> \
... summarise(new_value = sum(f.num * f.value) / sum(f.num))
group new_value
<object> <float64>
0 a 10.625000
1 b 8.823529
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.

Python Passing Dynamic Table Name in For Loop

table_name = []
counter=0
for year in ['2017', '2018', '2019']:
table_name.append(f'temp_df_{year}')
print(table_name[counter])
table_name[counter] = pd.merge(table1, table2.loc[table2.loc[:, 'year'] == year, :], left_on='col1', right_on='col1', how='left')
counter += 1
temp_df_2017
The print statement outputs are correct:
temp_df_2017,
temp_df_2018,
temp_df_2019
However, when I try to see what's in temp_df_2017, I get an error: name 'temp_df_2017' is not defined
I would like to create those three tables. How can I make this work?
PS: ['2017', '2018', '2019'] list will vary. It can be a list of quarters. That's why I want to do this in a loop, instead of using the merge statement 3x.
I think the easiest/most practical approach would be to create a dictionary to store names/df.
import pandas as pd
import numpy as np
# Create dummy data
data = np.arange(9).reshape(3,3)
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df_year_names = ['2017', '2018', '2019']
dict_of_dfs = {}
for year in df_year_names:
df_name = f'some_name_year_{year}'
dict_of_dfs[df_name] = df
dict_of_dfs.keys()
Out:
dict_keys(['some_name_year_2017', 'some_name_year_2018', 'some_name_year_2019'])
Then to access a particular year:
dict_of_dfs['some_name_year_2018']
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8

How to merge two dataframes and return data from another column in new column only if there is match?

I have a two df that look like this:
df1:
id
1
2
df2:
id value
2 a
3 b
How do I merge these two dataframes and only return the data from value column in a new column if there is a match?
new_merged_df
id value new_value
1
2 a a
3 b
You can try this using #JJFord3 setup:
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
#Use isin to create new_value
df2['new_value'] = df2['value'].where(df2.index.isin(df1.index))
#Use reindex with union to rebuild dataframe with both indexes
df2.reindex(df1.index.union(df2.index))
Output:
value new_value
1 NaN NaN
2 a a
3 b NaN
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
new_merged_df_outer = df1.merge(df2,how='outer',left_index=True,right_index=True)
new_merged_df_inner = df1.merge(df2,how='inner',left_index=True,right_index=True)
new_merged_df_inner.rename(columns={'value':'new_value'})
new_merged_df = new_merged_df_outer.merge(new_merged_df_inner,how='left',left_index=True,right_index=True)
First, create an outer merge to keep all indexes.
Then create an inner merge to only get the overlap.
Then merge the inner merge back to the outer merge to get the desired column setup.
You can use full outer join
Lets model your data with case classes:
case class MyClass1(id: String)
case class MyClass2(id: String, value: String)
// this one for the result type
case class MyClass3(id: String, value: Option[String] = None, value2: Option[String] = None)
Creating some inputs:
val input1: Dataset[MyClass1] = ...
val input2: Dataset[MyClass2] = ...
Joining your data:
import scala.implicits._
val joined = input1.as("1").joinWith(input2.as("2"), $"1.id" === $"2.id", "full_outer")
joined map {
case (left, null) if left != null => MyClass3(left.id)
case (null, right) if right != null => MyClass3(right.id, Some(right.value))
case (left, right) => MyClass3(left.id, Some(right.value), Some(right.value))
}
DataFrame.merge has in parameter indicator which
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
This can be used to check if there is a match
import pandas as pd
df1 = pd.DataFrame(index=[1,2])
df2 = pd.DataFrame({'value' : ['a','b']},index=[2,3])
# creates a new column `_merge` with values `right_only`, `left_only` or `both`
merged = df1.merge(df2, how='outer', right_index=True, left_index=True, indicator=True)
merged['new_value'] = merged.loc[(merged['_merge'] == 'both'), 'value']
merged = merged.drop('_merge', axis=1)
Use merge and isin:
df = df1.merge(df2,on='id',how='outer')
id_value = df2.loc[df2['id'].isin(df1.id.tolist()),'id'].unique()
mask = df['id'].isin(id_value)
df.loc[mask,'new_value'] = df.loc[mask,'value']
# alternative df['new_value'] = np.where(mask, df['value'], np.nan)
print(df)
id value new_value
0 1 NaN NaN
1 2 a a
2 3 b NaN

ARMA model order selection using arma_order_select_ic from statsmodel

I am using the arma_order_select_ic from the statsmodel library to calculate the (p,q) order for the ARMA model, I am using for loop to loop over the different companies that are in each column of the data-frame. The code is as follows:
import pandas as pd
from statsmodels.tsa.stattools import arma_order_select_ic
df = pd.read_csv("Adjusted_Log_Returns.csv", index_col = 'Date').dropna()
main_df = pd.DataFrame()
for i in range(146):
order_selection = arma_order_select_ic(df.iloc[i].values, max_ar = 4,
max_ma = 2, ic = "aic")
ticker = [df.columns[i]]
df_aic_min = pd.DataFrame([order_selection["aic_min_order"]], index =
ticker)
main_df = main_df.append(df_aic_min)
main_df.to_csv("aic_min_orders.csv")
The code runs fine and I get all the results in the csv file at the end but the thing thats confusing me is that when I compute the (p,q) outside the for loop for a single company then I get different results
order_selection = arma_order_select_ic(df["ABL"].values, max_ar = 4,
max_ma = 2, ic = "aic")
The order for the company ABL is (1,1) when computed in the for loop while its (4,1) when computed outside of it.
So my question is what am I doing wrong or why is it like this? Any help would be appreciated.
Thanks in Advance
It's pretty clear from your code that you're trying to find the parameters for an ARMA model on the columns' data, but it's not what the code is doing: you're finding in the loop the parameters for the rows.
Consider this:
import pandas as pd
df = pd.DataFrame({'a': [3, 4]})
>>> df.iloc[0]
a 3
Name: 0, dtype: int64
>>> df['a']
0 3
1 4
Name: a, dtype: int64
You should probably change your code to
for c in df.columns:
order_selection = arma_order_select_ic(df[c].values, max_ar = 4,
max_ma = 2, ic = "aic")
ticker = [c]

Convert groupBYKey to ReduceByKey Pyspark

How to convert groupbyKey to reduceByKey in pyspark. I have attached a snippet. This will apply a corr for each region dept week combination. I have used groupbyKey, but its very slow and Shuffle error (i have 10-20GB of data and each group will have 2-3GB). Please help me in rewriting this using reduceByKey
Data set
region dept week val1 valu2
US CS 1 1 2
US CS 2 1.5 2
US CS 3 1 2
US ELE 1 1.1 2
US ELE 2 2.1 2
US ELE 3 1 2
UE CS 1 2 2
output
region dept corr
US CS 0.5
US ELE 0.6
UE CS .3333
Code
def testFunction (key, value):
for val in value:
keysValue = val.asDict().keys()
inputpdDF.append(dict([(keyRDD, val[keyRDD]) for keyRDD in keysValue])
pdDF = pd.DataFrame(inputpdDF, columns = keysValue)
corr = pearsonr(pdDF['val1'].astype(float), pdDF['val1'].astype(float))[0]
corrDict = {"region" : key.region, "dept" : key.dept, "corr": corr}
finalRDD.append(Row(**corrDict))
return finalRDD
resRDD = df.select(["region", "dept", "week", "val1", "val2"])\
.map(lambda r: (Row(region= r.region, dept= r.dept), r))\
.groupByKey()\
.flatMap(lambda KeyValue: testFunction(KeyValue[0], list(KeyValue[1])))
Try:
>>> from pyspark.sql.functions import corr
>>> df.groupBy("region", "dept").agg(corr("val1", "val2"))

Resources