decision tree for predictive modelling

decision tree for predictive modelling - statistics

I have a satellite data that provides radiance which i use to compute Flux (using surface and cloud info). Now using a regression method, I can have a mathematical model relating radiance and flux and can be used to predict the flux for new radiance values without the other new inputs.
Is it possible to do same using decision trees or regression trees..? In a regression there is mathematical equation connecting dependent and independent variable. using decision trees, how you develop such a model?

It's best if you ask this in stats.stackexchange.com. A simple global regression model is a special case of a regression tree where there is only one node, so you can definitely apply regression tree for your data. Decision trees are generally used for classification, not regression.

Related

Is it possible to customise tf.estimator for Unsupervised Learning with self-defined evaluation graph not having loss?

I am implementing a item2vec model using the idea of word2vec
with tf.estimator API for product recommendation.
There's no problem implementing training part with tf.estimator. The process is same as word2vec, and I see each transactions as a sentence. Only difference is how to generate training input:(target_item, context_item) pairs. After training the pseudo-classification problem, I could use trained embedding vector for each items to measure relationship between them.
The problem is, for evaluation part, it is not a typical supervised learning evaluation, ie. with eval data as input, going through the same graph, we obtain predictions and accuracy.
The evaluation input data I would like to use, is in a totally different format from training input data.
Format of Eval input data: (target_item, {context_item1, context_item2, ...}). With this, I could obtain top_k nearest items for each context_items and then see if the target_item is in the collection of these nearest items, so that I could obtain a hit-ratio from it.
However, tf.estimator.EstimatorSpec() for mode = MODE.EVAL requires a loss as input. So, does it mean evaluation can only reuse part of the training graph? What could I do if I don't have a loss function for evaluation in my case, as the evaluation does not go through the classification anymore?
Many thanks.

Incremental learning - Set Initial Weights or values for Parameters from previous model for ML algorithm in Spark 2.0

I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.

How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.

It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

Modelling probabilities in a regularized (logistic?) regression model in python

I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.

The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.

How should decide about using linear regression model or non linear regression model

How should one decide between using a linear regression model or non-linear regression model?
My goal is to predict Y.
In case of simple x and y dataset I could easily decide which regression model should be used by plotting a scatter plot.
In case of multi-variant like x1,x2,...,xn and y. How can I decide which regression model has to be used? That is, How will I decide about going with simple linear model or non linear models such as quadric, cubic etc.
Is there any technique or statistical approach or graphical plots to infer and decide which regression model has to be used? Please advise.

That is a pretty complex question.
You start visually first: if the data is normally distributed, and satisfy conditions for classical linear model, you use linear model. I normally start by making a scatter plot matrix to observe the relationships. If it is obvious that the relationship is non linear then you use non-linear model. But, a lot of times, I visually inspect, assuming that the number of factors are just not too many.
For example, this would be a non linear model:
However, if you want to use data mining (and computationally demanding methods), I suggest starting with stepwise regression. What you do is set a model evaluation criteria first: could be R^2 for example. You start a model with nothing and sequentially add predictors or permutations of them until your model evaluation criteria is "maximized". However, adding new predictor almost always increases R^2, a type of over-fitting.
The solution is to split the data into training and testing. You should make model based on the training and evaluate the mean error on testing. The best model will be the one that that minimized mean error on the testing set.
If your data is sparse, try integrating ridge or lasso regression in model evaluation.
Again, this is a kind of a complex question. The answer also kind of depends on whether you are building descriptive or explanatory model.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.

kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.

The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string