"Most outlier" feature - scikit-learn

I am using the Sklearn implementation of Isolation Forest (IF) to detect outliers on a set of data of 20-30 features.
It is working very well, but I would like insight into which feature has the highest impact when an outlier is detected. Please note that my goal is not to get the feature importance (ie. which feature the IF model relies on most to detect outliers) but rather, in the case of this one specific outlier that was flagged as a -1, which feature was the furthest form the median/mean/cluster, etc...
Does this make sense?


Decision Trees - Scikit, Python

I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:
With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?
I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is tree.export_text(model), which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?
The confusion matrix that I have generated is as follows:
[[ 0 395]
[ 0 3319]]
i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?
Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to feature_importances (where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).
The most straight forward way to visualize a tree is tree.plot_tree(). In particular, you should try passing the names of the features to feature_names. Please show us what you have tried so far if you want a more specific answer.
Try another criterion, set a higher max_depth, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?

Emphasis on a feature while training a vanilla nn

I have some 360 odd features on which I am training my neural network model.
The accuracy I am getting is abysmally bad. There is one feature amongst the 360 that is more important than the others.
Right now, it does not enjoy any special status amongst the other features.
Is there a way to lay emphasis on one of the features while training the model? I believe this could improve my model's accuracy.
I am using Python 3.5 with Keras and Scikit-learn.
EDIT: I am attempting a regression problem
Any help would be appreciated
First of all, I would make sure that this feature alone has a decent prediction probability, but I am assuming that you already made sure of it.
Then, one approach that you could take, is to "embed" your 359 other features in a first layer, and only feed in your special feature once you have compressed the remaining information.
Contrary to what most tutorials make you believe, you do not have to add in all features already in the first layer, but can technically insert them at any point in time (or even multiple times).
The first layer that captures your other inputs is then some form of "PCA approximator", where you are embedding a high-dimensional feature space (359 dimensions) into something that is less dominant over your other feature (maybe 20-50 dimensions as a starting point?)
Of course there is no guarantee that this will work, but you might have a much better chance of getting attention on your special feature, although I am fairly certain that in general you should still see an increase in performance if the single feature is strongly enough correlated with your output.
The other question that is still open is the kind of task you are training for, i.e., whether you are doing some form of classification (if so, how many classes?), or regression. This might also influence architectural choices, and the amount of focus you can/should put on a single feature.
There are several feature selection and importance techniques in machine learning. Please follow this link.

How to see correlation between features in scikit-learn?

I am developing a model in which it predicts whether the employee retains its job or leave the company.
The features are as below
left (boolean)
During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image
When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.
On the other hand, if I just use pandas for analysis I got results something like this
In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.
While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.
Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.
Some please help me with this.
BTW I am doing ML in scikit-learn and the data is taken from here.
Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation.
In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.
For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

How can I evaluate the implicit feedback ALS algorithm for recommendations in Apache Spark?

How can you evaluate the implicit feedback collaborative filtering algorithm of Apache Spark, given that the implicit "ratings" can vary from zero to anything, so a simple MSE or RMSE does not have much meaning?
To answer this question, you'll need to go back to the original paper that defined what is implicit feedback and the ALS algorithm Collaborative Filtering for Implicit Feedback Datasets
by Yifan Hu, Yehuda Koren and Chris Volinsky.
What is implicit feedback ?
In the absence of explicit ratings, recommender systems can infer user preferences from the more abundant implicit feedback , which indirectly reflect opinion through observing user behavior.
Implicit feedback can include purchase history, browsing history, search patterns, or even mouse movements.
Do same evaluating techniques apply here? Such as RMSE, MSE.
It is important to realize that we do not have a reliable feedback regarding which items are disliked. The absence of a click or purchase can be related to multiple reasons. We also can't track user reactions to our recommendations.
Thus, precision based metrics, such as RMSE and MSE, are not very appropriate, as they require knowing which items users dislike for it to make sense.
However, purchasing or clicking on an item is an indication of having an interest in it. I wouldn't say like because a click or a purchase might have different meaning depending on the context of the recommender.
So making recall-oriented measures applicable in this case. So under this scenario, several metrics have been introduced, the most important being the Mean Percentage Ranking (MPR), also known as Percentile Ranking.
Lower values of MPR are more desirable. The expected value of MPR for random predictions is 50%, and thus MPR > 50% indicates an algorithm no better than random.
Of course, it's not the only way to evaluate recommender systems with implicit ratings but it's the most common one used in practice.
For more information about this metric, I advise you to read the paper stated above.
Ok, now we know what we are going to use but what about Apache Spark?
Apache Spark still doesn't provide an out-of-the-box implementation for this metric but hopefully not for long. There is a PR waiting to be validated https://github.com/apache/spark/pull/16618 concerning adding RankingEvaluator for spark-ml.
The implementation nevertheless isn't complicated. You can refer to the code here if you are interested in getting it sooner.
I hope this answers your question.
One way of evaluating it is to split the data in a training set and a test set with a time cut. This way you train the model using your training set then run predictions and check the predictions against the test set.
Now for evaluation you can use Precision, Recall, F1... metrics.

Bayesian statistics, machine learning: prior v.s hyperprior

I have a linear regression (say) model
p(t|x;w) = N(t ; m , D);
Being Bayesian, I can put a Gaussian prior on parameter w.
However, I've realized for some models we can put Gaussian-Wishart hyperprior on the Gaussian to be 'more' Bayesian. Is this correct ? Are both of these two models valid Bayesian models ?
It seems to me that we can always put hyperprior, hyperhyperprior,.......... because it will still be a valid probabilistic model.
I am wondering what's the difference between putting a prior and putting the hyperprior on the prior. Are they both Bayesian ?
Using a hyperprior is still "valid Bayesian" in the sense that this sort of hierarchical modeling is comes naturally to Bayesian models, and just about any book/course on Bayesian modeling does go through the use of hyperpriors.
It's completely fine to use Normal-Wishart as the prior (or hyperprior) of a Gaussian distribution. I guess it's, in some sense, even "more Bayesian" to do so if doing so models the phenomenon at hand more accurately.
I'm not sure what you mean by "are they both Bayesian" when it comes to the difference between using a prior and a hyperprior. Bayesian hierarchical models with hyperpriors are still Bayesian models.
Using hyperpriors only makes sense in a hierarchical Bayesian model. In that case you would be looking at multiple groups and estimate a group specific coefficient w_group based on group specific priors, with coefficients drawn from a global hyperprior.
If your prior and hyperprior reside on the same hierarchical level, which seems to be the case you are think about, then the effect on the results is the same as using a simple prior with a wider standard deviation. Since it still requires additional computational costs, such stacking should be avoided.
There is a lot of statistical literature on how to pick non-informative priors, often theoretically best solutions are improper distributions (their total integral is infinite) and there is a large risk of getting improper posterior solutions without well defined means or even medians. So for practical purposes picking wide normal distributions usually works best.
