I want to bootstrap observations with multivariate variables from a sample for performing multivariate statistics. Should I bootstrap observations in rows(each observation represented by a row) with all the variables as a whole, or bootstrap each variable to combine into an observation?
Related
I use DecisionTreeClassifier from sklearn
I need to correct splitter (feature), min_samples_leaf used in the particular tree node.
How can I do it?
You cannot define the min_samples_leaf for a single node, because the model would probably end up assigning fewer samples to other nodes than the min_samples_leaf of the whole model to ensure compliance with the rule applicable to this individual node.
If you are dealing with imbalanced data set, I suggest you oversample or undersample your data prior to inputting in the model or you could manually set the class weights.
According to scikit-learn's user guide:
Balance your dataset before training to prevent the tree from being
biased toward the classes that are dominant. Class balancing can be
done by sampling an equal number of samples from each class, or
preferably by normalizing the sum of the sample weights
(sample_weight) for each class to the same value.
Assume you have a scikit-learn pipeline with two steps: first, a transformer object (eg, PCA) and, second and last, an object with a predict method (eg, a logistic regression). Suppose that your grid contains four different hyperparameter configurations consisting of two different number of principal components (for PCA) and two different regularization parameters (for logistic regressions).
Is the first object (PCA) fitted 4 times (one for each configuration)? Or, on the contrary, it is fitted two times only: for each number of principal components, fits PCA once and twice the logistic regression. This way seems more efficient and equivalent in terms of results.
I thought that it was the second way, but I got confused by a comment on scikit-learn github (https://github.com/scikit-learn/scikit-learn/issues/4813#issuecomment-205185156)
In a hypothetical situation where I have 3 independent variables and one of those variables has a non-linear relationship(exponential) with the dependent variable and the other two independent variables are linearly related to the dependent variable. In such a case, what would be the best approach for running a regression analysis?
Considering I tried transforming the one non-linear independent variable.
I've been searching the web for possibilities to conduct multivariate regressions in Excel, where I've seen that Analysis ToolPak can make the job done.
However it seems that Analysis ToolPak can handle multivariable linear regression but not multivariate linear regression (where the latter is that one may have more than one dependent variable Y1,...,Yn = x1+x2..+xn and the former that a dependent variable can have multiple independent variables Y = x1+x2+..+xn).
Is there a way to conduct multivariate regressions in Excel or should I start looking for other programs like R?
Thanks in advance
Is this what you are looking for?
http://smallbusiness.chron.com/run-multivariate-regression-excel-42353.html
I am new to machine learning and I am working on a classification problem with Categorical (nominal) data. I have tried applying BayesNet and a couple of Trees and Rules classification algorithms to the raw data. I am able to achieve an AUC of 0.85.
I further want to improve the AUC by pre-processing or transforming the data. However since the data is categorical I don't think that log transform, addition, multiplication etc. of different columns will work here.
Can somebody list down what are most common transformations applied on categorical data-sets? ( I tried one-hot encoding but it takes a lot of memory!!)
Categorical is in my experience best dealt with one-hot encoding (e.g converting to a binary vector) as you've mentioned. If memory is an issue, it may be worthwhile using an online classification algorithm and generate the modified vectors on the fly.
Apart from this, if the categories represent a range (for example, if the categories represent a range of values such as age, height or income) it may be possible to treat the centre (or some appropriate mean, if there's an intra-label distribution) of the category ranges as a real number.
If you were applying clustering you could also treat the categorical labels as points on an axis (1,2,3,4,5 etc), scaled appropriately to the other features.