How to conduct multivariate regression in Excel? - excel

I've been searching the web for possibilities to conduct multivariate regressions in Excel, where I've seen that Analysis ToolPak can make the job done.
However it seems that Analysis ToolPak can handle multivariable linear regression but not multivariate linear regression (where the latter is that one may have more than one dependent variable Y1,...,Yn = x1+x2..+xn and the former that a dependent variable can have multiple independent variables Y = x1+x2+..+xn).
Is there a way to conduct multivariate regressions in Excel or should I start looking for other programs like R?
Thanks in advance

Is this what you are looking for?
http://smallbusiness.chron.com/run-multivariate-regression-excel-42353.html

Related

Decision Trees - Scikit, Python

I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:
With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?
I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is tree.export_text(model), which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?
The confusion matrix that I have generated is as follows:
[[ 0 395]
[ 0 3319]]
i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?
Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to feature_importances (where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).
The most straight forward way to visualize a tree is tree.plot_tree(). In particular, you should try passing the names of the features to feature_names. Please show us what you have tried so far if you want a more specific answer.
Try another criterion, set a higher max_depth, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?

What are the algorithms used in the pROC package for ROC analysis?

I am trying to figure out what algorithms are used within the pROC package to conduct ROC analysis. For instance what algorithm corresponds to the condition 'algorithm==2'? I only recently started using R in conjunction with Python because of the ease of finding CI estimates, significance test results etc. My Python code uses Linear Discriminant Analysis to get results on a binary classification problem. When using the pROC package to compute confidence interval estimates for AUC, sensitivity, specificity, etc., all I have to do is load my data and run the package. The AUC I get when using pROC is the same as the AUC that is returned by my Python code that uses Linear Discriminant Analysis (LDA). In order to be able to report consistent results I am trying to find out if LDA is one of the algorithm choices within pROC? Any ideas on this or how to go about figuring this out would be very helpful. Where can I access the source code for pROC?
The core algorithms of pROC are described in a 2011 BMC Bioinformatics paper. Some algorithms added later are described in the PDF manual. As every CRAN package, the source code is available from the CRAN package page. As many R packages these days it is also on GitHub.
To answer your question specifically, unfortunately I don't have a good reference for the algorithm to calculate the points of the ROC curve with algorithm 2. By looking at it you will realize it is ultimately equivalent to the standard ROC curve algorithm, albeit more efficient when the number of thresholds increases, as I tried to explain in this answer to a question on Cross Validated. But you have to trust me (and most packages calculating ROC curves) on it.
Which binary classifier you use, whether LDA or other, is irrelevant to ROC analysis, and outside the scope of pROC. ROC analysis is a generic way to assesses predictions, scores, or more generally signal coming out of a binary classifier. It doesn't assess the binary classifier itself, or the signal detector, only the signal itself. This makes it very easy to compare different classification methods, and is instrumental to the success of ROC analysis in general.

How should decide about using linear regression model or non linear regression model

How should one decide between using a linear regression model or non-linear regression model?
My goal is to predict Y.
In case of simple x and y dataset I could easily decide which regression model should be used by plotting a scatter plot.
In case of multi-variant like x1,x2,...,xn and y. How can I decide which regression model has to be used? That is, How will I decide about going with simple linear model or non linear models such as quadric, cubic etc.
Is there any technique or statistical approach or graphical plots to infer and decide which regression model has to be used? Please advise.
That is a pretty complex question.
You start visually first: if the data is normally distributed, and satisfy conditions for classical linear model, you use linear model. I normally start by making a scatter plot matrix to observe the relationships. If it is obvious that the relationship is non linear then you use non-linear model. But, a lot of times, I visually inspect, assuming that the number of factors are just not too many.
For example, this would be a non linear model:
However, if you want to use data mining (and computationally demanding methods), I suggest starting with stepwise regression. What you do is set a model evaluation criteria first: could be R^2 for example. You start a model with nothing and sequentially add predictors or permutations of them until your model evaluation criteria is "maximized". However, adding new predictor almost always increases R^2, a type of over-fitting.
The solution is to split the data into training and testing. You should make model based on the training and evaluate the mean error on testing. The best model will be the one that that minimized mean error on the testing set.
If your data is sparse, try integrating ridge or lasso regression in model evaluation.
Again, this is a kind of a complex question. The answer also kind of depends on whether you are building descriptive or explanatory model.

Comparing a Poisson Regression to a logistic Regression

I have data which has an associated binary outcome variable. Naturally I ran a logistic regression in order to see parameter estimates and odds ratios. I was curious though, to change this data from a binary outcome to count data. Then I ran a poisson regression (and negative binomial regression) on the count data.
I have no idea of how to compare these different models though, all comparisons I see seem to only be concerned with nested models.
How would you go about deciding on the best model to use in this situation?
Essentially both models will be roughly equal. What really matters is what is your objective- what you really want to predict. If you want to determine how many of cases are good or bad (1 or 0), then you go for logistic regression. If you are really interested on how much the cases are going to do (counts) then do poisson.
In other words, the only difference between these two models is the logistic transformation and the fact that logistic regression tries to minimize the misclassification error (-2 log likelihood) .To put it simply, even if you run a linear regression (OLS) on the binary outcome, you should not see big differences from your logistic model apart from the fact that the results may not be between 0 and 1 (e.g. the Area under the RoC curve will be similar to the logistic model) .
To sum up, don't worry about which of these two models is better, they should be roughly the same in the way the capture your features' information. Just think what makes more sense to optimize, counts or probabilties. The answer might have been different if you were considering non-linear models (e.g random forests or neural networks etc), but the two you are considering are both (almost) linear- so don't worry about it.
One thing to consider is the sample design. If you are using a case-control study, then logistic regression is the way to go because of its logit link function, rather than log of ratios as in Poisson regression. This is because, where there is an oversampling of cases such as in case-control study, odds ratio is unbiased.

Is it possible to compare the classification ability of two sets of features by ROC?

I am learning about SVM and ROC. As I know, people can usually use a ROC(receiver operating characteristic) curve to show classification ability of a SVM (Support Vector Machine). I am wondering if I can use the same concept to compare two subsets of features.
Assume I have two subsets of features, subset A and subset B. They are chosen from the same train data by 2 different features extraction methods, A and B. If I use these two subsets of features to train the same SVM by using the LIBSVM svmtrain() function and plot the ROC curves for both of them, can I compare their classification ability by their AUC values ? So if I have a higher AUC value for subsetA than subsetB, can I conclude that method A is better than method B ? Does it make any sense ?
Thank you very much,
Yes, you are on the right track. However, you need to keep a few things in mind.
Often using the two features A and B with the appropriate scaling/normalization can give better performance than the features individually. So you might also consider the possibility of using both the features A and B together.
When training SVMs using features A and B, you should optimize for them separately, i.e. compare the best performance obtained using feature A with the best obtained using feature B. Often the features A and B might give their best performance with different kernels and parameter settings.
There are other metrics apart from AUC, such as F1-score, Mean Average Precision(MAP) that can be computed once you have evaluated the test data and depending on the application that you have in mind, they might be more suitable.

Resources