xgboost: handling of missing values for split candidate search - search

in section 3.4 of their article, the authors explain how they handle missing values when searching the best candidate split for tree growing. Specifically, they create a default direction for those nodes with, as splitting feature, one with missing values in the current instance set. At prediction time, if the prediction path goes through this node and the feature value is missing, the default direction is followed.
However the prediction phase would break down when the feature values is missing and the node does not have a default direction (and this can occur in many scenarios). In other words, how do they associate a default direction to all nodes, even those with missing-free splitting feature in the active instance set at training time?

xgboost always accounts for a missing value split direction even if none are present is training. The default is the yes direction in the split criterion. Then it is learned if there are any present in training
From the author link
This can be observed by the following code
require(xgboost)
data(agaricus.train, package='xgboost')
sum(is.na(agaricus.train$data))
##[1] 0
bst <- xgboost(data = agaricus.train$data,
label = agaricus.train$label,
max.depth = 4,
eta = .01,
nround = 100,
nthread = 2,
objective = "binary:logistic")
dt <- xgb.model.dt.tree(model = bst) ## records all the splits
> head(dt)
ID Feature Split Yes No Missing Quality Cover Tree Yes.Feature Yes.Cover Yes.Quality
1: 0-0 28 -1.00136e-05 0-1 0-2 0-1 4000.5300000 1628.25 0 55 924.50 1158.2100000
2: 0-1 55 -1.00136e-05 0-3 0-4 0-3 1158.2100000 924.50 0 7 679.75 13.9060000
3: 0-10 Leaf NA NA NA NA -0.0198104 104.50 0 NA NA NA
4: 0-11 7 -1.00136e-05 0-15 0-16 0-15 13.9060000 679.75 0 Leaf 763.00 0.0195026
5: 0-12 38 -1.00136e-05 0-17 0-18 0-17 28.7763000 10.75 0 Leaf 678.75 -0.0199117
6: 0-13 Leaf NA NA NA NA 0.0195026 763.00 0 NA NA NA
No.Feature No.Cover No.Quality
1: Leaf 104.50 -0.0198104
2: 38 10.75 28.7763000
3: NA NA NA
4: Leaf 9.50 -0.0180952
5: Leaf 1.00 0.0100000
6: NA NA NA
> all(dt$Missing == dt$Yes,na.rm = T)
[1] TRUE
source code
https://github.com/tqchen/xgboost/blob/8130778742cbdfa406b62de85b0c4e80b9788821/src/tree/model.h#L542

My understanding of the algorithm is that a default direction is assigned probabalistically based on the distribution of the training data if no missing data is available at training time. IE. Just go in the direction with the majority of samples in the training set. In practice I'd say it's a bad idea to have missing data in your data set. Generally, the model will perform better if the data scientist cleans the data set up in a smart way before training the GBM algorithm. For example, replace all NA with the mean/median value or impute the value by finding the K nearest neighbors and averaging their values for that feature to impute the training point.
I'm also wondering why data would be missing at test time and not at train. That seems to imply the distribution of your data is evolving over time. An algorithm that can be trained as new data is available like a neural net may do better in you use case. Or you could always make a specialist model. For example let's say the missing feature is credit score in your model. Because some people may not approve you to access their credit. Why not train one model using credit and one not using credit. The model trained excluding credit may be able to get much of the lift credit was providing by using other correlated features.

Thank you for sharing your thoughts #Josiah. Yes I totally agree with you when you say that it is better to avoid missing data in the dataset, but sometimes it is not the optimal solution to replace them. In addition, if we have a learning algorithm such as GBM that can cope with them, why not to give them a try. The scenario I'm thinking about is when you have some features with few missings (<10%) or even less.
Regarding the second point, the scenario I have in mind is the following: the tree has already be grown to some depth so that the instance set is not the full one anymore. For a new node, the best candidate is found to be a value for a feature f that originally contains some missings, but not in the current instance set, so that no default branch is defined. So even if f contains some missings in the training dataset, this node doesn't have a default branch. A test instance falling here, would be stuck.
Maybe you are right and the default branch will be the one with more examples, if no missings are present. I'll try to reach out the authors and post here the reply, if any.

Related

Assessing features to labelencode or get_dummies() on dataset in Python

I'm working on the heart attack analysis on Kaggle in python.
I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).
age: age in years
sex: (1 = male; 0 = female)
cp: ordinal values 1-4
thalach: maximum heart rate achieved
exang: (1 = yes; 0 = no)
oldpeak: depression induced by exercise
slope: the slope of the peak exercise
ca: values (0-3)
thal: ordinal values 0-3
target: 0= less chance, 1= more chance
Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?
I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?
When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.
Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.
Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference.
95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score

STL decomposition getting rid of NaN values

Following links were investigated but didn't provide me with the answer I was looking for/fixing my problem: First, Second.
Due to confidentiality issues I cannot post the actual decomposition I can show my current code and give the lengths of the data set if this isn't enough I will remove the question.
import numpy as np
from statsmodels.tsa import seasonal
def stl_decomposition(data):
data = np.array(data)
data = [item for sublist in data for item in sublist]
decomposed = seasonal.seasonal_decompose(x=data, freq=12)
seas = decomposed.seasonal
trend = decomposed.trend
res = decomposed.resid
In a plot it shows it decomposes correctly according to an additive model. However the trend and residual lists have NaN values for the first and last 6 months. The current data set is of size 10*12. Ideally this should work for something as small as only 2 years.
Is this still too small as said in the first link? I.e. I need to extrapolate the extra points myself?
EDIT: Seems that always half of the frequency is NaN on both ends of trend and residual. Same still holds for decreasing size of data set.
According to this Github link another user had a similar question. They 'fixed' this issue. To avoid NaNs an extra parameter can be passed.
decomposed = seasonal.seasonal_decompose(x=data, freq=12, extrapolate_trend='freq')
It will then use a Linear Least Squares to best approximate the values. (Source)
Obviously the information was literally on their documentation and clearly explained but I completely missed/misinterpreted it. Hence I am answering my own question for someone who has the same issue, to save them the adventure I had.
According to the parameter definition below, setting extrapolate_trend other than 0 makes the trend estimation revert to a different estimation method. I faced this issue when I had a few observations for estimation.
extrapolate_trend : int or 'freq', optional
If set to > 0, the trend resulting from the convolution is
linear least-squares extrapolated on both ends (or the single one
if two_sided is False) considering this many (+1) closest points.
If set to 'freq', use `freq` closest points. Setting this parameter
results in no NaN values in trend or resid components.

Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about entropy.
I've used kmeans to cluster a bunch of samples, and I want an entropy greater than 0.9 (stats and psychology are not my expertise and this problem is both). I have 59 samples; each sample has 3 features in it. I look for the best covariance type via
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(data3)
where the n_components_range is just [2] (later I'll check 2 through 5).
Then I take the GMM with the lowest AIC or BIC, saved as best_eitherAB, (not shown) of the four. I want to see if the label assignments of the predictions are stable across time (I want to run for 1000 iterations), so I know I then need to calculate the entropy, which needs class assignment probabilities. So I predict the probabilities of the class assignment via gmm's method,
probabilities = best_eitherAB.predict_proba(data3)
all_probabilities.append(probabilities)
After all the iterations, I have an array of 1000 arrays, each contains 59 rows (sample size) by 2 columns (for the 2 classes). Each inner row of two sums to 1 to make the probability.
Now, I'm not entirely sure what to do regarding the entropy. I can just feed the whole thing into scipy.stats.entropy,
entr = scipy.stats.entropy(all_probabilities)
and it spits out numbers - as many samples as I have, I get a 2 item numpy matrix for each. I could feed just one of the 1000 tests in and just get 1 small matrix of two items; or I could feed in just a single column and get a single values back. But I don't know what this is, and the numbers are between 1 and 3.
So my questions are -- am I totally misunderstanding how I can use scipy.stats.entropy to calculate the stability of my classes? If I'm not, what's the best way to find a single number entropy that tells me how good my model selection is?

spark - MLlib: transform and manage categorical features

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD_Cat CENTRE_Cat NORTH_Cat
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...

Using kappa coefficient to evaluate results of crowd sourcing

I have 4 sets of manually tagged data for 0 and 1, by 4 different people. I have to get the final labelled data in terms of 0 and 1 using the 4 sets of manually tagged data. I have calculated the degree of agreement between the users as
A-B : 0.3276,
A-C : 0.3263,
A-D : 0.4917,
B-C : 0.2896,
B-D : 0.4052,
C-D : 0.3540.
I do not know how to use this to calculate the final data as a single set.
Please help.
The Kappa coefficient works only for a pair of annotators. For more than two, you need to employ an extension of it. One popular way of doing so is to use this expansion proposed by Richard Light in 1971, or to use the average expected agreement for all annotator pairs, proposed by Davies and Fleiss in 1982. I am not aware of any readily available calculator that will compute these for you, so you may have to implement the code yourself.
There is this Wikipedia page on Fleiss' kappa, however, which you might find helpful.
These techniques can only be used for nominal variables. If your data is not on the nominal scale, use a different measure like the intraclass correlation coefficient.

Resources