Randomforest clasification : How to infer class probability from "probabilityCol" - apache-spark

Background:
I am running a random-forest classifier on a dataFrame with label classes as [0,1] . My goal is to extract the probability of label '1' from the probabilityCol column.
As per the spark ml docs,
probabilityCol Vector of length # classes equal to rawPrediction normalized to a multinomial distribution
Question:
What is the ordering of the target classes within the vector probabilityCol ? Can we even determine the same ?
Incase i want to extract the possibility of a given class ('1' in my case), what is the recommended way for extracting the same.
Any leads will be appreciated.

1) The ordering corresponds to the numeric values of labelCol (your target column name). In probability vector class '0' always goes first, then goes class '1' etc. RandomForest works only with numeric class values, so they always act like indexes.
2) Suppose you have dataframe prediction with column probability. To get the probability for class 1 you can use UDF function:
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions.udf
val classNum = 1
def getTop(x : DenseVector) : Double = {
x.toArray(classNum)
}
val udfGetTop = udf(getTop _)
val predictionTop = prediction
.select("labelIndexed", "probability")
.withColumn("label1Prob", udfGetTop($"probability"))

Related

One Hot Encoding a composite field

I want to transform multiple columns with same categorical values using a OneHotEncoder. I created a composite field and tried to use OneHotEncoder on it as below: (Items 1-3 are from the same list of items)
import pyspark.sql.functions as F
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = StringIndexer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
encoder = OneHotEncoder(setInputCol="basketIndex", setOutputCol="basketVec")
encoded = encoder.transform(indexed)
def myConcat(*cols):
return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])
I am getting an out of memory error.
Does this approach work? How do I one hot encode a composite field or multiple columns with categorical values from same list?
If you have categorical values array why you didn't try CountVectorizer:
import pyspark.sql.functions as F
from pyspark.ml.feature import CountVectorizer
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = CountVectorizer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
Note: I can't comment yet (due to the fact that I'm a new user).
What is the cardinality of your "item1", "item2" and "item3"
More specifically, what are the values that the following prints is giving ?
k1 = df.item1.nunique()
k2 = df.item2.nunique()
k3 = df.item3.nunique()
k = k1 * k2 * k3
print (k1, k2, k3)
One hot encoding is basically creating a very sparse matrix of same number of rows as your original dataframe with k number of additional columns, where k = products of the three numbers printed above.
Therefore, if your 3 numbers are large, you get out of memory error.
The only solutions are to:
(1) increase your memory or
(2) introduce a hierarchy among the categories and use the higher level categories to limit k.

Reconstructing k-means using pre-computed cluster centres

I'm using k-means for clustering with number of clusters 60. Since, some of the clusters are coming out as meaning less, I've deleted those cluster centers from cluster center array(count = 8) and saved in clean_cluster_array.
This time, I'm re-fitting k-means model with init = clean_cluster_centers. and n_clusters = 52 and max_iter = 1 because i want to avoid re-fitting as much as possible.
The basic idea is to recreate new model with clean_cluster_centers . The problem here is since, we are removing large number of clusters; The model is quickly configuring to more stable centers even with n_iter = 1. Is there any way to recreate k-means model?
If you've fitted a KMeans object, it has a cluster_centers_ attribute. You can directly update it by doing something like this:
cls.cluster_centers_ = new_cluster_centers
So if you want a new object with the clean cluster centers, just do something like the following:
cls = KMeans().fit(X)
cls2 = cls.copy()
cls2.cluster_centers_ = new_cluster_centers
And now, since the predict function only checks that your object has a non-null attribute called cluster_centers_, you can use the predict function
def predict(self, X):
"""Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, `cluster_centers_` is called
the code book and each value returned by `predict` is the index of
the closest code in the code book.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns
-------
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return _labels_inertia(X, x_squared_norms, self.cluster_centers_)[0]

What is StringIndexer , VectorIndexer, and how to use them?

Dataset<Row> dataFrame = ... ;
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(dataFrame);
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("s")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(dataFrame);
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels());
What is StringIndexer, VectorIndexer, IndexToString and what is the difference between them? How and When should I use them?
String Indexer - Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.
e,g converting days(Monday, Tuesday...) to numeric representation.
Vector Indexer- use this if we do not know the types of data incoming. so we leave the logic of differentiating between categorical and non categorical data to the algorithm using Vector Indexer.
e,g - Data coming from 3rd Party API, where data is hidden and is ingested directly to the training model.
Indexer to string- just opposite of String indexer, use this if the final output column was indexed using String Indexer and now we want to convert back its numeric representation to textual so that result can be understood better.
I know only about those two:
StringIndexer and VectorIndexer
StringIndexer:
converts a single column to an index column (similar to a factor column in R)
VectorIndexer:
is used to index categorical predictors in a featuresCol column. Remember that featuresCol is a single column consisting of vectors (refer to featuresCol and labelCol). Each row is a vector which contains values from each predictors.
if you have string type predictors, you will first need to use index those columns with StringIndexer. featuresCol contains vectors, and vectors does not contain string values.
Take a look here for example: https://mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html

Spark: Dimensions mismatch error with RDD[LabeledPoint] union

I would ideally like to do the following:
In essence, what I want to do is for my dataset that is RDD[LabeledPoint], I want to control the ratio of positive and negative labels.
val training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "training_data.tsv")
This dataset has both cases and controls included in it. I want to control the ratio of cases to controls (my dataset is skewed). So I want to do something like sample training_data such that the ratio of cases to controls is 1:2 (instead of 1:500 say).
I was not able to do that therefore, I separated the training data into cases and controls as below and then was trying to combine them later using union operator, which gave me the Dimensions mismatch error.
I have two datasets (both in Libsvm format):
val positives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positives.tsv")
val negatives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negatives.tsv")
I want to combine these two to form training data. Note both are in libsvm format.
training = positives.union(negatives)
When I use the above training dataset in model building (such as logistic regression) I get error since both positives and negatives can have different number of columns/dimensions. I get this error: "Dimensions mismatch when merging with another summarizer" Any idea how to handle that?
In addition, I also want to do samplings such as
positives_subset = positives.sample()
I was able to solve this in the following way:
def create_subset(training: RDD[LabeledPoint], target_label: Double, sampling_ratio: Double): RDD[LabeledPoint] = {
val training_filtered = training.filter { case LabeledPoint(label, features) => (label == target_label) }
val training_subset = training_filtered.sample(false, sampling_ratio)
return training_subset
}
Then calling the above method as:
val positives = create_subset(training, 1.0, 1.0)
val negatives_sampled = create_subset(training, 0.0, sampling_ratio)
Then you can take the union as:
val training_subset_double = positives.union(negatives_double)
and then I was able to use the training_subset_double for model building.

Does Spark.ml LogisticRegression assumes numerical features only?

I was looking at the Spark 1.5 dataframe/row api and the implementation for the logistic regression. As I understand, the train method therein first converts the dataframe to RDD[LabeledPoint] as,
override protected def train(dataset: DataFrame): LogisticRegressionModel = {
// Extract columns from data. If dataset is persisted, do not persist oldDataset.
val instances = extractLabeledPoints(dataset).map {
case LabeledPoint(label: Double, features: Vector) => (label, features)
}
...
And then it proceeds to feature standardization, etc.
What I am confused with is, the DataFrame is of type RDD[Row] and Row is allowed to have any valueTypes, for e.g. (1, true, "a string", null) seems a valid row of a dataframe. If that is so, what does the extractLabeledPoints above mean? It seems it is selecting only Array[Double] as the feature values in Vector. What happens if a column in the data-frame was strings? Also, what happens to the integer categorical values?
Thanks in advance,
Nikhil
Lets ignore Spark for a moment. Generally speaking linear models, including logistic regression, expect numeric independent variables. It is not in any way specific to Spark / MLlib. If input contains categorical or ordinal variables these have to be encoded first. Some languages, like R, handle this in a transparent manner:
> df <- data.frame(x1 = c("a", "b", "c", "d"), y=c("aa", "aa", "bb", "bb"))
> glm(y ~ x1, df, family="binomial")
Call: glm(formula = y ~ x1, family = "binomial", data = df)
Coefficients:
(Intercept) x1b x1c x1d
-2.357e+01 -4.974e-15 4.713e+01 4.713e+01
...
but what is really used behind the scenes is so called design matrix:
> model.matrix( ~ x1, df)
(Intercept) x1b x1c x1d
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
...
Skipping over the details it is the same type of transformation as the one performed by the OneHotEncoder in Spark.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = sqlContext.createDataFrame(Seq(
Tuple1("a"), Tuple1("b"), Tuple1("c"), Tuple1("d")
)).toDF("x").repartition(1)
val indexer = new StringIndexer()
.setInputCol("x")
.setOutputCol("xIdx")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("xIdx")
.setOutputCol("xVec")
val encoded = encoder.transform(indexed)
encoded
.select($"xVec")
.map(_.getAs[Vector]("xVec").toDense)
.foreach(println)
Spark goes one step further and all features, even if algorithm allows nominal/ordinal independent variables, have to be stored as Double using a spark.mllib.linalg.Vector. In case of spark.ml it is a DataFrame column, in spark.mllib a field in spark.mllib.regression.LabeledPoint.
Depending on a model interpretation of the feature vector can be different though. As mentioned above for linear model these will be interpreted as numerical variables. For Naive Bayes theses are considered nominal. If model accepts both numerical and nominal variables Spark and treats each group in a different way, like decision / regression trees, you can provide categoricalFeaturesInfo parameter.
It is worth pointing out that dependent variables should be encoded as Double as well but, unlike independent variables, may require additional metadata to be handled properly. If you take a look at the indexed DataFrame you'll see that StringIndexer not only transforms x, but also adds attributes:
scala> org.apache.spark.ml.attribute.Attribute.fromStructField(indexed.schema(1))
res12: org.apache.spark.ml.attribute.Attribute = {"vals":["d","a","b","c"],"type":"nominal","name":"xIdx"}
Finally some Transformers from ML, like VectorIndexer, can automatically detect and encode categorical variables based on the number of distinct values.
Copying clarification from zero323 in the comments:
Categorical values before being passed to MLlib / ML estimators have to be encoded as Double. There quite a few built-in transformers like StringIndexer or OneHotEncoder which can be helpful here. If algorithm treats categorical features in a different manner than a numerical ones, like for example DecisionTree, you identify which variables are categorical using categoricalFeaturesInfo.
Finally some transformers use special attributes on columns to distinguish between different types of attributes.

Resources