Does Spark.ml LogisticRegression assumes numerical features only? - apache-spark

I was looking at the Spark 1.5 dataframe/row api and the implementation for the logistic regression. As I understand, the train method therein first converts the dataframe to RDD[LabeledPoint] as,
override protected def train(dataset: DataFrame): LogisticRegressionModel = {
// Extract columns from data. If dataset is persisted, do not persist oldDataset.
val instances = extractLabeledPoints(dataset).map {
case LabeledPoint(label: Double, features: Vector) => (label, features)
}
...
And then it proceeds to feature standardization, etc.
What I am confused with is, the DataFrame is of type RDD[Row] and Row is allowed to have any valueTypes, for e.g. (1, true, "a string", null) seems a valid row of a dataframe. If that is so, what does the extractLabeledPoints above mean? It seems it is selecting only Array[Double] as the feature values in Vector. What happens if a column in the data-frame was strings? Also, what happens to the integer categorical values?
Thanks in advance,
Nikhil

Lets ignore Spark for a moment. Generally speaking linear models, including logistic regression, expect numeric independent variables. It is not in any way specific to Spark / MLlib. If input contains categorical or ordinal variables these have to be encoded first. Some languages, like R, handle this in a transparent manner:
> df <- data.frame(x1 = c("a", "b", "c", "d"), y=c("aa", "aa", "bb", "bb"))
> glm(y ~ x1, df, family="binomial")
Call: glm(formula = y ~ x1, family = "binomial", data = df)
Coefficients:
(Intercept) x1b x1c x1d
-2.357e+01 -4.974e-15 4.713e+01 4.713e+01
...
but what is really used behind the scenes is so called design matrix:
> model.matrix( ~ x1, df)
(Intercept) x1b x1c x1d
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
...
Skipping over the details it is the same type of transformation as the one performed by the OneHotEncoder in Spark.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = sqlContext.createDataFrame(Seq(
Tuple1("a"), Tuple1("b"), Tuple1("c"), Tuple1("d")
)).toDF("x").repartition(1)
val indexer = new StringIndexer()
.setInputCol("x")
.setOutputCol("xIdx")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("xIdx")
.setOutputCol("xVec")
val encoded = encoder.transform(indexed)
encoded
.select($"xVec")
.map(_.getAs[Vector]("xVec").toDense)
.foreach(println)
Spark goes one step further and all features, even if algorithm allows nominal/ordinal independent variables, have to be stored as Double using a spark.mllib.linalg.Vector. In case of spark.ml it is a DataFrame column, in spark.mllib a field in spark.mllib.regression.LabeledPoint.
Depending on a model interpretation of the feature vector can be different though. As mentioned above for linear model these will be interpreted as numerical variables. For Naive Bayes theses are considered nominal. If model accepts both numerical and nominal variables Spark and treats each group in a different way, like decision / regression trees, you can provide categoricalFeaturesInfo parameter.
It is worth pointing out that dependent variables should be encoded as Double as well but, unlike independent variables, may require additional metadata to be handled properly. If you take a look at the indexed DataFrame you'll see that StringIndexer not only transforms x, but also adds attributes:
scala> org.apache.spark.ml.attribute.Attribute.fromStructField(indexed.schema(1))
res12: org.apache.spark.ml.attribute.Attribute = {"vals":["d","a","b","c"],"type":"nominal","name":"xIdx"}
Finally some Transformers from ML, like VectorIndexer, can automatically detect and encode categorical variables based on the number of distinct values.

Copying clarification from zero323 in the comments:
Categorical values before being passed to MLlib / ML estimators have to be encoded as Double. There quite a few built-in transformers like StringIndexer or OneHotEncoder which can be helpful here. If algorithm treats categorical features in a different manner than a numerical ones, like for example DecisionTree, you identify which variables are categorical using categoricalFeaturesInfo.
Finally some transformers use special attributes on columns to distinguish between different types of attributes.

Related

SkLearn: Feature Union with a dictionary and text data

I have a DataFrame like:
text_data worker_dicts outcomes
0 "Some string" {"Sector":"Finance", 0
"State: NJ"}
1 "Another string" {"Sector":"Programming", 1
"State: NY"}
It has both text information, and a column that is a dictionary. (The real worker_dicts has many more fields). I'm interested in the binary outcome column.
What I initially tried doing was to combine both text_data and worker_dict, crudely concatenating both columns, and then running Multinomial NB on that:
df['stacked_features']=df['text_data'].astype(str)+'_'+df['worker_dicts']
stacked_features = np.array(df['stacked_features'])
outcomes = np.array(df['outcomes'])
text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english'), ngram_range = (1,3)),
('clf', MultinomialNB())])
text_clf = text_clf.fit(stacked_features, outcomes)
But I got very bad accuracy, and I think that fitting two independent models would be a better use of data than fitting one model on both types of features (as I am doing with stacking).
How would I go about utilizing Feature Union? worker_dicts is a little weird because it's a dictionary, so I'm very confused as to how I'd go about parsing that.
If your dictionary entries are categorical as they appear to be in your example, then I would create different columns from the dictionary entries before doing additional processing.
new_features = pd.DataFrame(df['worker_dicts'].values.tolist())
Then new_features will be its own dataframe with columns Sector and State and you can one hot encode those as needed in addition to TFIDF or other feature extraction for your text_data column. In order to use that in a pipeline, you would need to create a new transformer class, so I might suggest just applying the dictionary parsing and the TFIDF separately, then stacking the results, and adding OneHotEncoding to your pipeline as that allows you to specify columns to apply the transformer to. (As the categories you want to encode are strings you may want to use LabelBinarizer class instead of OneHotEncoder class for the encoding transformation.)
If you want to just use TFIDF on all of the columns individually with a pipeline, you would need to use a nested Pipeline and FeatureUnion set up to extract columns as described here.
If you have your one hot encoded features in dataframes X1 and X2 as described below and your text features in X3, you could do something like the following to create a pipeline. (There are many other options, this is just one way)
X = pd.concat([X1, X2, X3], axis=1)
def select_text_data(X):
return X['text_data']
def select_remaining_data(X):
return X.drop('text_data', axis=1)
# pipeline to get all tfidf and word count for first column
text_pipeline = Pipeline([
('column_selection', FunctionTransformer(select_text_data, validate=False)),
('tfidf', TfidfVectorizer())
])
final_pipeline = Pipeline([('feature-union', FeatureUnion([('text-features', text_pipeline),
('other-features', FunctionTransformer(select_remaining_data))
])),
('clf', LogisticRegression())
])
(MultinomialNB won't work in the pipeline because it doesn't have fit and fit_transform methods)

What is StringIndexer , VectorIndexer, and how to use them?

Dataset<Row> dataFrame = ... ;
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(dataFrame);
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("s")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(dataFrame);
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels());
What is StringIndexer, VectorIndexer, IndexToString and what is the difference between them? How and When should I use them?
String Indexer - Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.
e,g converting days(Monday, Tuesday...) to numeric representation.
Vector Indexer- use this if we do not know the types of data incoming. so we leave the logic of differentiating between categorical and non categorical data to the algorithm using Vector Indexer.
e,g - Data coming from 3rd Party API, where data is hidden and is ingested directly to the training model.
Indexer to string- just opposite of String indexer, use this if the final output column was indexed using String Indexer and now we want to convert back its numeric representation to textual so that result can be understood better.
I know only about those two:
StringIndexer and VectorIndexer
StringIndexer:
converts a single column to an index column (similar to a factor column in R)
VectorIndexer:
is used to index categorical predictors in a featuresCol column. Remember that featuresCol is a single column consisting of vectors (refer to featuresCol and labelCol). Each row is a vector which contains values from each predictors.
if you have string type predictors, you will first need to use index those columns with StringIndexer. featuresCol contains vectors, and vectors does not contain string values.
Take a look here for example: https://mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html

Randomforest clasification : How to infer class probability from "probabilityCol"

Background:
I am running a random-forest classifier on a dataFrame with label classes as [0,1] . My goal is to extract the probability of label '1' from the probabilityCol column.
As per the spark ml docs,
probabilityCol Vector of length # classes equal to rawPrediction normalized to a multinomial distribution
Question:
What is the ordering of the target classes within the vector probabilityCol ? Can we even determine the same ?
Incase i want to extract the possibility of a given class ('1' in my case), what is the recommended way for extracting the same.
Any leads will be appreciated.
1) The ordering corresponds to the numeric values of labelCol (your target column name). In probability vector class '0' always goes first, then goes class '1' etc. RandomForest works only with numeric class values, so they always act like indexes.
2) Suppose you have dataframe prediction with column probability. To get the probability for class 1 you can use UDF function:
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions.udf
val classNum = 1
def getTop(x : DenseVector) : Double = {
x.toArray(classNum)
}
val udfGetTop = udf(getTop _)
val predictionTop = prediction
.select("labelIndexed", "probability")
.withColumn("label1Prob", udfGetTop($"probability"))

Spark: Normalising/Stantardizing test-set using training set statistics

This is a very common process in Machine Learning.
I have a dataset and I split it into training set and test set.
Since I apply some normalizing and standardization to the training set,
I would like to use the same info of the training set (mean/std/min/max
values of each feature), to apply the normalizing and standardization
to the test set too. Do you know any optimal way to do that?
I am aware of the functions of MinMaxScaler, StandardScaler etc..
You can achieve this via a few lines of code on both the training and test set.
On the training side there are two approaches:
MultivariateStatisticalSummary
http://spark.apache.org/docs/latest/mllib-statistics.html
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each
Using SQL
from pyspark.sql.functions import mean, min, max
In [6]: df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
On the testing data - you can then manually "normalize the data using the statistics obtained above from the training data. You can decide in which sense you wish to normalize: e.g.
Student's T
val normalized = testData.map{ m =>
(m - trainMean) / trainingSampleStddev
}
Feature Scaling
val normalized = testData.map{ m =>
(m - trainMean) / (trainMax - trainMin)
}
There are others: take a look at https://en.wikipedia.org/wiki/Normalization_(statistics)

Spark: Dimensions mismatch error with RDD[LabeledPoint] union

I would ideally like to do the following:
In essence, what I want to do is for my dataset that is RDD[LabeledPoint], I want to control the ratio of positive and negative labels.
val training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "training_data.tsv")
This dataset has both cases and controls included in it. I want to control the ratio of cases to controls (my dataset is skewed). So I want to do something like sample training_data such that the ratio of cases to controls is 1:2 (instead of 1:500 say).
I was not able to do that therefore, I separated the training data into cases and controls as below and then was trying to combine them later using union operator, which gave me the Dimensions mismatch error.
I have two datasets (both in Libsvm format):
val positives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positives.tsv")
val negatives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negatives.tsv")
I want to combine these two to form training data. Note both are in libsvm format.
training = positives.union(negatives)
When I use the above training dataset in model building (such as logistic regression) I get error since both positives and negatives can have different number of columns/dimensions. I get this error: "Dimensions mismatch when merging with another summarizer" Any idea how to handle that?
In addition, I also want to do samplings such as
positives_subset = positives.sample()
I was able to solve this in the following way:
def create_subset(training: RDD[LabeledPoint], target_label: Double, sampling_ratio: Double): RDD[LabeledPoint] = {
val training_filtered = training.filter { case LabeledPoint(label, features) => (label == target_label) }
val training_subset = training_filtered.sample(false, sampling_ratio)
return training_subset
}
Then calling the above method as:
val positives = create_subset(training, 1.0, 1.0)
val negatives_sampled = create_subset(training, 0.0, sampling_ratio)
Then you can take the union as:
val training_subset_double = positives.union(negatives_double)
and then I was able to use the training_subset_double for model building.

Resources