Should we exclude target variable from DFS in featuretools? - featuretools

While passing the dataframes as entities in an entityset and use DFS on that, are we supposed to exclude target variable from the DFS? I have a model that had 0.76 roc_auc score after traditional feature selection methods tried manually and used feature tools to see if it improves the score. So used DFS on entityset that included target variable as well. Surprisingly, the roc_auc score went up to 0.996 and accuracy to 0.9997 and so i am doubtful of the scores as i passed target variable as well into Deep Feature Synthesis and there the infor related to the target might have been leaked to the training? Am i assuming correct?

Deep Feature Synthesis and Featuretools do allow you to keep your target in your entity set (in order to create new features using historical values of it), but you need to set up the “time index” and use “cutoff times” to do this without label leakage.
You use the time index to specify the column that holds the value for when data in each row became known. This column is specified using the time_index keyword argument when creating the entity using entity_from_dataframe.
Then, you use cutoff times when running ft.dfs() or ft.calculate_feature_matrix() to specify the last point in time you should use data when calculating each row of your feature matrix. Feature calculation will only use data up to and including the cutoff time. So, if this cutoff time is before the time index value of your target, you won’t have label leakage.
You can read about those concepts in detail in the documentation on Handling Time.
If you don’t want to deal with the target at all you can
You can use pandas to drop it out of your dataframe entirely before making it an entity. If it’s not in the entityset, it can’t be used to create features.
You can set the drop_contains keyword argument in ft.dfs to ['target']. This stops any feature from being created which includes the string 'target'.
No matter which of the above options you do, it is still possible to pass a target column directly through DFS. If you add the target to your cutoff times dataframe it is passed through to the resulting feature matrix. That can be useful because it ensures the target column remains aligned with the other features. You can an example of passing the label through here in the documentation.
Advanced Solution using Secondary Time Index
Sometimes a single time index isn’t enough to represent datasets where information in a row became known at two different times. This commonly occurs when the target is a column. To handle this situation, we need to use a “secondary time index”.
Here is an example from a Kaggle kernel on predicting when a patient will miss an appointment with a doctor where a secondary time index is used. The dataset has a scheduled_time, when the appointment is scheduled, and an appointment_day, which is when the appointment actually happens. We want to tell Featuretools that some information like the patient’s age is known when they schedule the appointment, but other information like whether or not a patient actually showed up isn't known until the day of the appointment.
To do this, we create an appointments entity with a secondary time index as follows:
es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
dataframe=data,
index='appointment_id',
time_index='scheduled_time',
secondary_time_index={'appointment_day': ['no_show', 'sms_received']})
This says that most columns can be used at the time index scheduled_time, but that the variables no_show and sms_received can’t be used until the value in secondary time index.
We then make predictions at the scheduled_time by setting our cutoff times to be
cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']]
By passing that dataframe into DFS, the no_show column will be passed through untouched, but while historical values of no_show can still be used to create features. An example would be something like ages.PERCENT_TRUE(appointments.no_show) or “the percentage of people of each age that have not shown up in the past”.

If you are using the target variable in your DFS, than you are leaking information about it in your training data. So you have to remove your target variable while you are doing every kind of feature engineering (manuall or via DFS).

Related

Brightway2 - Get LCA scores of immediate exchanges

I'm having some problems regarding the post-processing analysis of my LCA results from brightway2. After running a LCA calculation, if, for example, I type top_activities() I get a list of a bunch of activities and their associated scores, however none of the activities/scores are the ones associated directly with my functional unit (they appear to be some exchanges of my exchanges...).
How can I get the LCA scores of the exchanges (both technosphere and biosphere) I defined when constructing my Functional Unit?
Thanks!
I've found the best way to get aggregated results for your foreground model in brightway is using the bw2analyzer.traverse_tagged_databases() function rather than top_activities(). Details in the docs are here.
It's designed to calculate upstream impacts of the elements of your foreground model and then aggregate the impacts based on a tag it finds in the activity. e.g. if you add 'tag':'use phase' or 'tag':'processing' to your activities you can aggregate impact results by life cycle stage.
BUT you can change the default label it looks for, so instead of tag you can tell it to look for name - that'll give you the aggregated upstream impact of each of the activities in your foreground model. It returns a dictionary with the names of your tags as keys, and impacts as values. It also returns a graph of your foreground system which you can use to create some cool tree/bullseye charts - see the docs for the format.
Here's the function you need:
results, graph = recurse_tagged_databases(functional_unit, method, label='name')
Here are a couple of examples of the kinds of visualisations you can make using the data recurse_tagged_databases gives you:
Waterfall chart example from the results dictionary
Bullseye chart example from the tagged graph
It is pretty easy to traverse the supply chain manually, and everyone wants to do this a slightly different way, so it isn't built in to Brightway yet. Here is a simple example:
from brightway2 import *
func_unit = Database("ecoinvent 3.4 cutoff").random()
lca = LCA({func_unit: 1}, methods.random())
lca.lci()
lca.lcia()
print(func_unit)
for exc in func_unit.technosphere():
lca.redo_lcia({exc.input: exc['amount']})
print(exc.input, exc['amount'], lca.score)

How does SAS pick reference group when using CLASS statement?

How does SAS pick reference group when using CLASS statement?
I have a categorical variable and it can take on about 200 different values. Is it good practice to create dummies for only specific characteristics of this variable? I know that the other values are rarely used and in a correlation analysis they are not significant in predicting Y. The example is: There are about 200 different add-ons and the outcome variable is Sale (success vs. no success) the model is a logistic regression. I want to see whether any of these add ons seem to be more popular among customers and therefore are more likely to lead to a sale. Other IV are: how much the customer already pays on a monthly basis, where the customer comes from and which location the sales agent comes from.
How does SAS pick reference group when using CLASS statement?
By default, the first value in sort order is picked as the reference variable. This can be changed with the ref= option.
class var(ref='B')
Is it good practice to create dummies for only specific
characteristics of this variable?
That's a question better asked on Cross Validated

OpenMDAO 1.x: recording desvars, constraints and objective

How can you get information about which variables are design vars, objectives or constraints from the information saved by recorders? It would be useful to print this information to a file to track optimization progress during a run. It looks like the RecordingManager.record_iteration doesn't really allow for this at the moment, since you only pass the root system and a metadata dict meant for optimizer settings.
Would it be possible to add an argument to the RecordingManager.record_iteration called e.g. optproblem, which is a dictionary with dictionaries with desvars, constraints and objective?
A simple OptimizationRecorder could then dump out column formatted files with the quantities for easy plotting during the optimisation.
This is something we have on our list of to-do's for the near future. Our current planned approach is going to be to augment the meta-data (already being saved) of variables with labels identifying them as des-vars, objectives, and constraints. Then you could pull that information out as part of a custom case recorder if you want. We plan on doing it this way because it doesn't require modifying the recorder's api at all. I think we'll have something like this implemented in the next month or so.

How to use Apache Spark ALS (alternating-least-squares) algorithm with limited Rating values

I am trying to use ALS, but currently my data is limited to information about what user bought. So I was trying to fill ALS from Apache Spark with Ratings equal 1 (one) when user X bought item Y (and only such information I provided to that algorithm).
I was trying to learn it (divided data to train/test/validation) or was trying just to learn on all data but at the end I was getting prediction with extremely similar values for any pair user-item (values differentiated on 5th or 6th place after comma like 0,86001 and 0,86002).
I was thinking about that and maybe it is because I can provide only rating equal 1 so does ALS cannot be used in such extreme situation?
Is there any trick with ratings so I could use to fix such problem (I have only information's about what was bought - later I am going to get more data, but at a moment I have to use some kind of collaborative filtering until I will acquire more data - in other words I need to show user some kind of recommendation on startup page I choose ALS for startup page but maybe I use something else, what exactly)?
Ofcourse I was changing parameters like iterations, lambda, rank.
In this case, the key is that you must use trainImplicit, which ignores Rating's value. Otherwise you're asking it to predict ratings in a world where everyone rates everything 1. The right answer is invariably 1, so all your answers are similar.

Weka attribute selection output

I want to perform attribute selection in Weka, but my dataset is rather big, and the program runs quite a while. That's why I want to see the current best set of attributes found. How do I do it?
For example, genetic search has the "Report Frequency" parameter, but all the results are shown after the whole search is finished, that's not what I need.
There is no progress bar, so I don't even know for how long will I have to wait...
Feature or Attribute selection is a standard problem in data-mining and Machine learning domains.
If you want to select a good set of attributes, you must preprocess your data by ranking attributes based on their quality. Ranking Methods such as p-metric or t-statistic are popular, which are based on statistical measures. One cannot simply go about by randomly selecting attributes from a large set without any sort of intuition on the nature of attributes.
If you do not need to run the attribute selection on your whole dataset you could use a smaller sample of your dataset (simply edit your ARFF file) to run the attribute selection.

Resources