How to create interesting values using value combinations from multiple features/columns - featuretools

I am fairly new to featuretools, and trying to understand if and how one can add interesting values to an entity set generated using multiple features.
For example, I have an entity set with two entities: customers and transactions. Transactions can be debit or credit (c_d) and can occur across different spending categories (tran_category) - restaurants, clothing, groceries, etc.
Thus far, I am able to create interesting values for either of these features but not from a combination of them:
import featuretools as ft
x = ft.EntitySet()
x.entity_from_dataframe(entity_id = 'customers', dataframe = customer_ids, index = cust_id)
x.entity_from_dataframe(entity_id = 'transactions', dataframe = transactions, index = tran_id, time_index = 'transaction_date')
x_rel = ft.Relationship(x['parties']['cust_id'], x['transactions']['cust_id])
x.add_relationship(x_rel)
x['transactions']['d_c'].interesting_values = ['D', 'C']
x['transactions']['tran_category'].interesting_values = ['restaurants', 'clothing', 'groceries']
How can I add an interesting value that combines values from c_d AND tran_category? (i.e. restaurant debits, grocery credits, clothing debits, etc.). The goal is to then use these interesting values to aggregate across transaction amounts, time between transactions, etc., using where_primitives:
feature_matrix, feature_defs = ft.dfs(entityset = x, target_entity = 'customers', agg_primitives = list_of_agg_primitives, where_primitives = list_of_where_primitives, trans_primitives = list_of_trans_primitives, max_depth = 3)

Currently, there is no way to do that.
One approach would be to create a new column d_c__tran_category that has all the possible combinations of d_c and tran_category and then add interesting values to that column.
x['transactions']['d_c__tran_category'].interesting_values = ['D_restaurants', 'C_restaurants', 'D_clothing', 'C_clothing','D_groceries', 'C_groceries']

Related

Merge every x element in multiple lists to return new list

I'm writing a script that scrapes all of the data from my works ticketing site and the end goal is to have it send a text when a new ticket enters the bucket with all of the important info of the ticket.
Python 3.10
So far, it pulls from a scattered list and combines all of the elements into an appropriate group ie. ticket numbers,titles and priorities.
tn = rawTickets[0::14]
title = rawTickets[5::14]
priority = rawTickets[9::14]
With this I can say
num = x
wholeticket = tn[num], title[num], priority[num],
print(wholeticket)
and get x ticket in the list
# Results: "tn0, title0, priority0"
I want it to print all of the available tickets in the list based on a range
totaltickets = 0
for lines in rawTickets:
if lines == '':
totaltickets += 1
numrange = range(totaltickets)
so lets say there are only 3 tickets in the queue,
I want it to print
tn0, title0, priority0,
tn1, title1, priority1,
tn2, title2, priority2,
But I want to avoid doing this;
ticket1 = tn[0], title[0], priority[0],
ticket2 = tn[1], title[1], priority[1],
ticket3 = tn[2], title[2], priority[2],
flowchart to help explain
You could use zip:
tickets = list(zip(rawTickets[0::14], rawTickets[5::14], rawTickets[9::14]))
This will give you a list of 3-tuples.
You could do something like that:
l1 = [*range(0,5)]
l2 = [*range(5,10)]
l3 = [*range(10,15)]
all_lst = [(l1[i], l2[i], l3[i]) for i in range(len(l1))]
Or you could use zip as trincot offered.
Note that on large scales, zip is much faster.

How can I subset filtering a row per categories with a for loop

How can i subset these lines of code with a for loop
I'm trying to subset these lines of code but I couldn't, I think that it could be done with a group by and a dictionary but I'm couldn't
df_belgium = df_sales[df_sales["Country"]=="Belgium"]
df_norway = df_sales[df_sales["Country"]=="Norway"]
df_portugal = df_sales[df_sales["Country"]=="portugal"]
The most straightforward way would be to loop through ["Belgium","Norway","portugal"], but trying to create objects with variable variable names like df_{country_name} is highly discouraged (see here), so I would recommend creating a dictionary to store your subset dataframes with the country names as keys.
You can use a dict comprehension:
df_sales_by_country = {country_name: df_sales[df_sales["Country"]==country_name] for country_name in ["Belgium","Norway","portugal"]}
The ideal is to use groupby and to store the sub-DataFrames in a dictionary:
d = dict(df.groupby('Country'))
Then access d['Belgium'] for example.
If you need to filter a subset of the countries:
# use a set for efficiency
keep = {'Belgium', 'Norway', 'Portugal'}
d = {key: g for key, g in df.groupby('Country') if country in keep}
or:
keep = ['Belgium', 'Norway', 'Portugal']
d = dict(df[df['Country'].isin(keep)].groupby('Country'))

Identifying duplicate items in a list

I want to figure out how to identify any case of identical items in a list.
Currently, there is a list of people and I want to first identify their surnames and put their surnames in a separate list called list_surnames.
Then I want to loop through that list and figure out whether there are instances of people having the same surname and if so I would add that to the amount value.
this code currently does not identify cases of duplication in that list.
Should be said I am brand new to learning programming, I apologize if code is horrible
group = ["Jonas Hansen", "Bo Klaus Nilsen", "Ida Kari Lund Toftegaard", "Ole Hansen"]
amount = 0
list_surnames = []
for names in group:
new_list = names.split(" ")
extract_surname = new_list[-1:]
for i in extract_surname:
list_surnames.append(i)
for x in list_surnames:
if x == list_surnames:
amount += 1
print(list_surnames)
print(amount)
You can use the Counter to count
from collections import Counter
l = ["Jonas Hansen", "Bo Klaus Nilsen", "Ida Kari Lund Toftegaard", "Ole Hansen"]
last = [names.split()[-1] for names in l]
print(last)
c = Counter(last)
print(c)

Get feature names for dataframe.corr

I am using the cancer data set from sklearn and I need to find the correlations between features. I am able to find the correlated columns, but I am not able to present them in a "nice" way, so that they will be an input for Dataframe.drop.
Here is my code:
cancer_data = load_breast_cancer()
df=pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
corr = df.corr()
#filter to find correlations above 0.6
corr_triu = corr.where(~pd.np.tril(pd.np.ones(corr.shape)).astype(pd.np.bool))
corr_triu = corr_triu.stack()
corr_result = corr_triu[corr_triu > 0.6]
print(corr_result)
df.drop(columns=[?])
IIUC, you want the columns that correlate with some other column in the dataset, ie drop columns that don't appear in corr_result. So you'll want to get the unique variables from the index of corr_result, from each level. There may be repeats so take care of that as well, such as with sets:
corr_result.index = corr_result.index.remove_unused_levels()
corr_vars = set()
corr_vars.update(corr_result.index.unique(level=0))
corr_vars.update(corr_result.index.unique(level=1))
all_vars = set(df.columns)
df.drop(columns=all_vars - corr_vars)

How do I build a string of variable names?

I'm trying to build a string that contains all attributes of a class-object. The object name is jsonData and it has a few attributes, some of them being
jsonData.Serial,
jsonData.InstrumentSerial,
jsonData.Country
I'd like to build a string that has those attribute names in the format of this:
'Serial InstrumentSerial Country'
End goal is to define a schema for a Spark dataframe.
I'm open to alternatives, as long as I know order of the string/object because I need to map the schema to appropriate values.
You'll have to be careful about filtering out unwanted attributes, but try this:
' '.join([x for x in dir(jsonData) if '__' not in x])
That filters out all the "magic methods" like __init__ or __new__.
To include those, do
' '.join(dir(jsonData))
These take advantage of Python's dir method, which returns a list of all attributes of an object.
I don't quite understand why you want to group the attribute names in a single string.
You could simply have a list of attribute names as the order of a python list is persist.
attribute_names = [x for x in dir(jsonData) if '__' not in x]
From there you can create your dataframe. If you don't need to specify the SparkTypes, you can just to:
df = SparkContext.createDataFrame(data, schema = attribute_names)
You could also create a StructType and specify the types in your schema.
I guess that you are going to have a list of jsonData records that you want to consider as Rows.
Let's considered it as a list of objects, but the logic would still be the same.
You can do that as followed:
my_object_list = [
jsonDataClass(Serial = 1, InstrumentSerial = 'TDD', Country = 'France'),
jsonDataClass(Serial = 2, InstrumentSerial = 'TDI', Country = 'Suisse'),
jsonDataClass(Serial = 3, InstrumentSerial = 'TDD', Country = 'Grece')]
def build_record(obj, attr_names):
from operator import attrgetter
return attrgetter(*attr_names)(obj)
So the data attribute referred previously would be constructed as:
data = [build_record(x, attribute_names) for x in my_object_list]

Resources