How to select features from feature_def created through deep feature synthesis - featuretools

I am using deep feature synthesis to create new features. How can I select features from feature_def.
For example, I need to select all the features with string "Age" in it.
I tried the following code which gave me an error "argument of type 'IdentityFeature' is not iterable"
feature_matrix, feature_defs = ft.dfs(entityset= es, target_entity= 'titanic', max_depth= 2)
features = []
for s in feature_defs:
if 'Age' in s:
features.append(s)

You need to use the .get_name() method on the feature definition. For example,
feature_matrix, feature_defs = ft.dfs(entityset= es, target_entity= 'titanic', max_depth= 2)
features = []
for s in feature_defs:
if 'Age' in s.get_name():
features.append(s)

Related

Create a fork on a sklearn transformer pipeline to allow data to pass through

I have an sklearn pipeline that looks like this
You'll notice the duplicate step, features_to_vectorize, in the left and right side of the FeatureUnion. features_to_vectorize is the result of applying a DictVectorizer to a pandas DataFrame column. I'd like to then take features_to_vectorize and concatenate it with a transformation on itself. My current setup duplicates the transformation because I'm not sure how to create a fork at features_to_vectorize where I can create a passthrough for that data but also apply a transformation on that data and later FeatureUnion it. Any ideas how to better set this up to avoid duplicate computation? Thanks
sum_along_columns = FunctionTransformer(np.sum, kw_args={"axis": 1})
col_trans = ColumnTransformer([("features_to_vectorize", DictVectorizer(), "col")])
out = FeatureUnion(
[
("pipeline", Pipeline([("d_vec", col_trans), ("sum", sum_along_columns)])),
("column_transformer", col_trans),
]
)
Ideally it should look like
SOLUTION:
col_trans = ColumnTransformer([("features_to_vectorize", DictVectorizer(), "col")])
ident = FunctionTransformer()
fts = FeatureUnion([("sum", SumColumns()), ("ident", ident)])
out = Pipeline([("dv", col_trans), ("sum_and_pass", fts)])
where SumColumns is a simple transformation np.sum(axis=1).reshape(-1,1) in order to conform to 2d outputs that sklearn enforces
ColumnTransformer can send the same column to multiple transformers, so this should do:
sum_along_columns = FunctionTransformer(np.sum, kw_args={"axis": 1})
col_trans = ColumnTransformer([("features_to_vectorize", DictVectorizer(), "col")])
split = ColumnTransformer([
('sum', sum_along_columns, [0]),
('ident', 'passthrough', [0]),
])
out = Pipeline([
('vectorize', col_trans),
('split', split),
])
One issue is that after the 'vectorize' step in the pipeline you have an array not a frame, so we can't rely on the feature name in split, and hence [0].
You could also stick to FeatureUnion and implement your own simple identity transformer, e.g. using FunctionTransformer again, instead of using the ColumnTransformer's 'passthrough'.

How do you search for particular features?

At last when I tried featuretools I was searching for a particular feature which I was expecting. When you have > 30 feature it is kind of time consuming to find the feature.
Has the feature_names object (second return object of the dfs method) a method to search for some text patterns (regex)?
feature_names is a list of "featuretools.feature_base.feature_base.IdentityFeature"
Post Scriptum: In the featuretools documentation of the API the return objects are not described
Deep Feature Synthesis returns feature objects. If you call FeatureBase.get_name() on one of those objects, it will return the name as a string. You can use this to implement whatever selection logic you'd like. For example, here is the code to make a list of all feature objects where amount is in the name
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
fl = ft.dfs(target_entity="customers", entityset=es, features_only=True)
keep = []
for feature in fl:
if "amount" in feature.get_name():
keep.append(feature)

Custom NER Spacy throwing IndexError: list index out of range

I am trying to create custom NER using Spacy but, while training, I am getting the following error:
gold = GoldParse(doc, entities=entity_offsets)
File "gold.pyx", line 565, in spacy.gold.GoldParse.init
IndexError: list index out of range
Any idea as to how I can fix this?
The most common resolution that came up after doing some google search was to trim leading and trailing white spaces in the training data. So I used this code to trim them off. But still was of no use.
'''
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
'''
Ah, so the words would be a keyword argument on the GoldParse object. This lets you specify the gold-standard tokenization, if it doesn't match spaCy's tokenization. Assuming your input looks like this:
text = 'helloworld'
words = ['hello', 'world']
tags = ['INTJ', 'NOUN']
You can do the following:
doc = Doc(text)
gold = GoldParse(doc, words=words, tags=tags)
nlp.update([doc], [gold])
Alternatively, you can also use the new "simple training style" and just pass in the text as a string, and the annotations as a dictionary:
nlp.update([text], [{'words': words, 'tags': tags}])
In general, we'd recommend using the simple style, as it removes one layer of abstraction, and lets you get rid of the Doc and GoldParse imports. But ultimately, the style you choose depends on your personal preference.

How to create interesting values using value combinations from multiple features/columns

I am fairly new to featuretools, and trying to understand if and how one can add interesting values to an entity set generated using multiple features.
For example, I have an entity set with two entities: customers and transactions. Transactions can be debit or credit (c_d) and can occur across different spending categories (tran_category) - restaurants, clothing, groceries, etc.
Thus far, I am able to create interesting values for either of these features but not from a combination of them:
import featuretools as ft
x = ft.EntitySet()
x.entity_from_dataframe(entity_id = 'customers', dataframe = customer_ids, index = cust_id)
x.entity_from_dataframe(entity_id = 'transactions', dataframe = transactions, index = tran_id, time_index = 'transaction_date')
x_rel = ft.Relationship(x['parties']['cust_id'], x['transactions']['cust_id])
x.add_relationship(x_rel)
x['transactions']['d_c'].interesting_values = ['D', 'C']
x['transactions']['tran_category'].interesting_values = ['restaurants', 'clothing', 'groceries']
How can I add an interesting value that combines values from c_d AND tran_category? (i.e. restaurant debits, grocery credits, clothing debits, etc.). The goal is to then use these interesting values to aggregate across transaction amounts, time between transactions, etc., using where_primitives:
feature_matrix, feature_defs = ft.dfs(entityset = x, target_entity = 'customers', agg_primitives = list_of_agg_primitives, where_primitives = list_of_where_primitives, trans_primitives = list_of_trans_primitives, max_depth = 3)
Currently, there is no way to do that.
One approach would be to create a new column d_c__tran_category that has all the possible combinations of d_c and tran_category and then add interesting values to that column.
x['transactions']['d_c__tran_category'].interesting_values = ['D_restaurants', 'C_restaurants', 'D_clothing', 'C_clothing','D_groceries', 'C_groceries']

Getting Feature Names

Suppose I have 4 feature(names)s ['year2000', 'year2001','year2002','year2003'], used during learning with Decision Tree classifier.
How can I obtain the names of the important feature from feature_importances_since it directly gives me some numbers rather than the feature names
Suppose you put feature names in a list
feature_names = ['year2000', 'year2001','year2002','year2003']
Then the problem is just to get the indices of features with top k importance
feature_importances = clf.feature_importances_
k = 3
top_k_idx = feature_importances.argsort()[-k:][::-1]
print feature_names[top_k_idx]

Resources