How does SAS pick reference group when using CLASS statement?
I have a categorical variable and it can take on about 200 different values. Is it good practice to create dummies for only specific characteristics of this variable? I know that the other values are rarely used and in a correlation analysis they are not significant in predicting Y. The example is: There are about 200 different add-ons and the outcome variable is Sale (success vs. no success) the model is a logistic regression. I want to see whether any of these add ons seem to be more popular among customers and therefore are more likely to lead to a sale. Other IV are: how much the customer already pays on a monthly basis, where the customer comes from and which location the sales agent comes from.
How does SAS pick reference group when using CLASS statement?
By default, the first value in sort order is picked as the reference variable. This can be changed with the ref= option.
class var(ref='B')
Is it good practice to create dummies for only specific
characteristics of this variable?
That's a question better asked on Cross Validated
Related
Can chatbots like [Rasa] learn from the trusted user - new additional employees, product ids, product categories or properties - or unlearn when these entities are no longer current ?
Or do I have to go through formal data collection, training sessions, testing (confidence rates > given ratio), before the new version be made operational.
If you have entity values that are being checked against a shifting list of valid values, it's more scalable to check those values against a database that is always up to date (e.g. your backend systems probably have a queryable list of current employees). Then if a user provides a value that used to be valid and now isn't, it will act the same as if a user provided an invalid value in the first place.
This way, the entity extraction can stay the same regardless of if some training examples go out of relevance -- though of course it's always good to try to keep your data up to date!
Many Chatbots do not have such a function. Except avanced ones like Alexa, with the keyword "Remember" available 2017 +/-. The user wants Alexa to commit to memory certain facts.
IMHO such a feature is a mark of "intelligence". It is not trivial to implement in ML systems where coefficients in their neural network models are updated by back-propagation after passing learning examples. Rule-based systems (such as CHAT80 a QA system on geography) store their knowledge in relations that can be updated more transparently.
While passing the dataframes as entities in an entityset and use DFS on that, are we supposed to exclude target variable from the DFS? I have a model that had 0.76 roc_auc score after traditional feature selection methods tried manually and used feature tools to see if it improves the score. So used DFS on entityset that included target variable as well. Surprisingly, the roc_auc score went up to 0.996 and accuracy to 0.9997 and so i am doubtful of the scores as i passed target variable as well into Deep Feature Synthesis and there the infor related to the target might have been leaked to the training? Am i assuming correct?
Deep Feature Synthesis and Featuretools do allow you to keep your target in your entity set (in order to create new features using historical values of it), but you need to set up the “time index” and use “cutoff times” to do this without label leakage.
You use the time index to specify the column that holds the value for when data in each row became known. This column is specified using the time_index keyword argument when creating the entity using entity_from_dataframe.
Then, you use cutoff times when running ft.dfs() or ft.calculate_feature_matrix() to specify the last point in time you should use data when calculating each row of your feature matrix. Feature calculation will only use data up to and including the cutoff time. So, if this cutoff time is before the time index value of your target, you won’t have label leakage.
You can read about those concepts in detail in the documentation on Handling Time.
If you don’t want to deal with the target at all you can
You can use pandas to drop it out of your dataframe entirely before making it an entity. If it’s not in the entityset, it can’t be used to create features.
You can set the drop_contains keyword argument in ft.dfs to ['target']. This stops any feature from being created which includes the string 'target'.
No matter which of the above options you do, it is still possible to pass a target column directly through DFS. If you add the target to your cutoff times dataframe it is passed through to the resulting feature matrix. That can be useful because it ensures the target column remains aligned with the other features. You can an example of passing the label through here in the documentation.
Advanced Solution using Secondary Time Index
Sometimes a single time index isn’t enough to represent datasets where information in a row became known at two different times. This commonly occurs when the target is a column. To handle this situation, we need to use a “secondary time index”.
Here is an example from a Kaggle kernel on predicting when a patient will miss an appointment with a doctor where a secondary time index is used. The dataset has a scheduled_time, when the appointment is scheduled, and an appointment_day, which is when the appointment actually happens. We want to tell Featuretools that some information like the patient’s age is known when they schedule the appointment, but other information like whether or not a patient actually showed up isn't known until the day of the appointment.
To do this, we create an appointments entity with a secondary time index as follows:
es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
dataframe=data,
index='appointment_id',
time_index='scheduled_time',
secondary_time_index={'appointment_day': ['no_show', 'sms_received']})
This says that most columns can be used at the time index scheduled_time, but that the variables no_show and sms_received can’t be used until the value in secondary time index.
We then make predictions at the scheduled_time by setting our cutoff times to be
cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']]
By passing that dataframe into DFS, the no_show column will be passed through untouched, but while historical values of no_show can still be used to create features. An example would be something like ages.PERCENT_TRUE(appointments.no_show) or “the percentage of people of each age that have not shown up in the past”.
If you are using the target variable in your DFS, than you are leaking information about it in your training data. So you have to remove your target variable while you are doing every kind of feature engineering (manuall or via DFS).
How can you get information about which variables are design vars, objectives or constraints from the information saved by recorders? It would be useful to print this information to a file to track optimization progress during a run. It looks like the RecordingManager.record_iteration doesn't really allow for this at the moment, since you only pass the root system and a metadata dict meant for optimizer settings.
Would it be possible to add an argument to the RecordingManager.record_iteration called e.g. optproblem, which is a dictionary with dictionaries with desvars, constraints and objective?
A simple OptimizationRecorder could then dump out column formatted files with the quantities for easy plotting during the optimisation.
This is something we have on our list of to-do's for the near future. Our current planned approach is going to be to augment the meta-data (already being saved) of variables with labels identifying them as des-vars, objectives, and constraints. Then you could pull that information out as part of a custom case recorder if you want. We plan on doing it this way because it doesn't require modifying the recorder's api at all. I think we'll have something like this implemented in the next month or so.
I am trying to use ALS, but currently my data is limited to information about what user bought. So I was trying to fill ALS from Apache Spark with Ratings equal 1 (one) when user X bought item Y (and only such information I provided to that algorithm).
I was trying to learn it (divided data to train/test/validation) or was trying just to learn on all data but at the end I was getting prediction with extremely similar values for any pair user-item (values differentiated on 5th or 6th place after comma like 0,86001 and 0,86002).
I was thinking about that and maybe it is because I can provide only rating equal 1 so does ALS cannot be used in such extreme situation?
Is there any trick with ratings so I could use to fix such problem (I have only information's about what was bought - later I am going to get more data, but at a moment I have to use some kind of collaborative filtering until I will acquire more data - in other words I need to show user some kind of recommendation on startup page I choose ALS for startup page but maybe I use something else, what exactly)?
Ofcourse I was changing parameters like iterations, lambda, rank.
In this case, the key is that you must use trainImplicit, which ignores Rating's value. Otherwise you're asking it to predict ratings in a world where everyone rates everything 1. The right answer is invariably 1, so all your answers are similar.
I have a task that is probably related to data analysis or even neural networks.
We have a data source of our partners, job portal. The source values are arrays of different attributes related to the particular employee:
His\her gender,
Age,
Years of experience,
Portfolio (number of the projects done),
Profession and specialization (web design, web programming, management etc.),
many other (around 20-30 totally)
Every employee has it's own salary (hourly) rate. So, mathematically, we have some function
F(attr1, attr2, attr3, ...) = A*attr1 + B*attr2 + C*attr3 + ...
With unknown coefficient. But we know the result of the function for the specified arguments (let's say, we know that a male programmer with 20 years of experience and 10 works in portfolio has a rate of $40 per hour).
So we have to find somehow these coefficients (A, B, C...), so we can predict the salary of any employee. This is the most important goal.
Another goal is to find which arguments are most important - in other words, which of them cause significant changes to the result of the function. So in the end we have to have something like this: "The most important attributes are years of experience; then portfolio; then age etc.".
There may be a situation when different professions vary too much from each other - for example, we simply may not be able to compare web designers with managers. In this case, we have to split them by groups and calculate these ratings for every group separately. But in the end we need to find 'shared' arguments that will be common for every group.
I'm thinking about neural networks because it's something they may deal with. But I'm completely new to them and have totally no idea what to do.
I'd very appreciate any help - which instruments to use, what algorithms, or even pseudo-code samples etc.
Thank you very much.
That is the most basic example of (linear) regression. You are using a linear function to model your data, and need to estimate the parameters.
Note that this is actually a part of classic mathematical statistics; not data mining yet but much much older.
There are various methods. Given that there likely will be outliers, I would suggest to use RANSAC.
As for the importance, doesn't this boil down to "which is largest, A B or C"?