Decision Trees - Scikit, Python - scikit-learn

I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:
With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?
I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is tree.export_text(model), which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?
The confusion matrix that I have generated is as follows:
[[ 0 395]
[ 0 3319]]
i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?

Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to feature_importances (where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).
The most straight forward way to visualize a tree is tree.plot_tree(). In particular, you should try passing the names of the features to feature_names. Please show us what you have tried so far if you want a more specific answer.
Try another criterion, set a higher max_depth, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?

Related

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.

How to see correlation between features in scikit-learn?

I am developing a model in which it predicts whether the employee retains its job or leave the company.
The features are as below
satisfaction_level
last_evaluation
number_projects
average_monthly_hours
time_spend_company
work_accident
promotion_last_5years
Department
salary
left (boolean)
During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image
here
When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.
On the other hand, if I just use pandas for analysis I got results something like this
In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.
While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.
Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.
Some please help me with this.
BTW I am doing ML in scikit-learn and the data is taken from here.
Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation.
In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.
For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

How do I combine the results (predictions) if I have generated predictions from say, randomforest, gbm, svm, nnet to get more accurate predictions?

More specifically, how to improve my predictions by creating an ensemble? I mean I know that we perform regression on the results(predictions) one gets from different algorithms(random forest, nnet, gbm, svm). But how do we do this ?
There exists a lot of different ways you can do to address this. One way would be to train a new top level model, which input would be a vector of the output from all other techniques.
Another way to solve this is by using the approach often used with support vector machines to solve the multiclass problem.
This approach resembles a tournament match between each model, where all the outputs are compared between them to choose a winner. Commonly this is done using one of two strategies “One Against One” or “One Against All” ( Here is a paper comparing them http://www.researchgate.net/publication/228643199_One_Against_One_or_One_Against_All_Which_One_is_Better_for_Handwriting_Recognition_with_SVMs )
These are the ways I have on top of my mind, but I am pretty sure you can find a lot of novel ways to address this.

Suitable data mining technique for this dataset

I'm working on a data mining project and would like to mine this dataset Higher Education Enrolments for interesting patterns or knowledge. My problem is figuring out which technique would work best for the dataset.
I'm currently working on the dataset using RapidMiner 5.0 and I removed two columns (E550 - Reference year, E931 - Total Student EFTSL) from the data as they would not be relevant to the analysis. The rest of the attributes are nominal except StudentID (integer) which I have used as my id. I'm currently using classification on it (Naive Bayes) but would like to get the opinion of others, hopefully those who have had more experience in this area. Thanks.
The best technique depends on many factors: type/distribution of training and target attribute, domain, value range of attributes, etc. The best technique to use is the result of data analysis and understanding.
In this particular case, you should clarify which is the attribute to predict.
Unless you already know what you are looking for, and know about the quality of the data source, you should always start by trying various exploratory analysis:
look at some of the first and second order statistics of all the
variables
generate histograms of each variable, to get an idea of the empirical
distribution of each
take a look at pairwise scatter plots of variables that might have
dependency
try other visualization that you might think of
These would give you a rough idea about what kind of pattern might be present and might be discoverable given the noise level. Then depending on what kind of pattern you are interested in, you could start trying various unsupervised pattern learning methods such as, PCA/ICA/factor analysis, clustering, or supervised methods, such as regression, classification.

Resources