spark ml pipeline - 2 classes in labeled field but spark cant run binomial regression - apache-spark

i validated that the label field has only 2 possible values by running select distinct on it.
but spark gives an error that it can't assign binomial family in logistic regression for 3 classes in label field

Related

Logistic regression Dummy variables from different variables are perfectly correlated what to do?

Let's say that we have a dataset with 3 variables: age_banded,number_of_accounts_banded and country
These 3 variables can have the value: "missing" when there is no information about that customer. For logistic regression with categorical variables we want to take dummy variables. What do we do with the dummies for the missing category? age_banded:missing,number_of_accounts_banded:missing and country:missing will all have correlation equal to 1 so there is no need to have all three dummies in the dataset.

How does sklearn Logistic Regression handle null values?

I have several features in my Logistic regression model. One of the features has multiple NULL values. I want these rows to be included in my model.
How does Logistic Regression handle null values when computing? Does it take into account the null value as part of the feature? I did not come across errors when running.
If Logistic Regression ignores it, would I have to change the null values to an actual string value in order for logistic regression to recognize it as part of the model?

Scoring Model giving reversed results using logistic regression

I am trying to implement a scoring model following the link https://rstudio-pubs-static.s3.amazonaws.com/376828_032c59adbc984b0ab892ce0026370352.html#1_introduction.
Post the entire implementation though, When I create pivot with my generated scores and the original labels, the average scores for "good' labels is significantly lower than the ones for " high" labels.
Hence, my problem can be oversimplified to why would logistic regression give reversed probabilities for 0-1 target variable( In my model I am using 0 for bad and 1 for good).
Any suggestions and solutions would be welcome.

How do I run the Spark logistic regression with categorical features using python?

I have a data with some categorical variables and I want to run a logistic regression using Mllib , it seems like the model support only continous variables.
Does anyone know how to deal with this please ?
Logistic regression, like the other linear models, takes as input an RDD whereas a LabeledPoint is a Double (the label) and the associated Vector (a double Array).
Categorical values (Strings) are not supported, however you could convert those to binary columns.
For example if you have a column RAG taking values Red, Amber and Green, you would add three binary column isRed, isAmber and isGreen of which only one of them is 1 (true) and the others are 0 (zero) for each sample.
See as further explanation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

How to use SMOTE in Microsoft Azure

There is a module named SMOTE(Synthetic Minority Oversampling Technique ) which increase the number of samples of under sampled data, I guess we should choose a feature(feature to be predicted) which is under represented. How to choose it? There seems to be no option on choosing the coloumn.
I guess you are referring to the target variable (label column). You can set that using a Metadata Editor module. Choose your label column using the column selector and set the Fields property to Labels.
Here is the SMOTE definition - SMOTE is an approach for the construction of classifiers from imbalanced datasets, which is when classification categories are not approximately equally represented. The classification category is the feature that the classifier is trying to learn. There is not an option of choosing the column in the SMOTE module because it should be the label column
Here is the details on how to use SMOTE in Azure Machine Learning - https://msdn.microsoft.com/en-us/library/azure/dn913076.aspx?f=255&MSPPError=-2147217396
You can do it thru the column selector. In the sample below, the blood donation data (a sample dataset in Azure ML) has 25% of people who donated (class 1).

Resources