Spark and categorical string variables - apache-spark

I'm trying to understand how handles string categorical independent variables. I know that in Spark I have to convert strings to doubles using StringIndexer.
Eg., "a"/"b"/"c" => 0.0/1.0/2.0.
But what I really would like to avoid is then having to use OneHotEncoder on that column of doubles. This seems to make the pipeline unnecessarily messy. Especially since Spark knows that the data is categorical. Hopefully the sample code below makes my question clearer.
val df = sqlContext.createDataFrame(Seq(
Tuple2(0.0,"a"), Tuple2(1.0, "b"), Tuple2(1.0, "c"), Tuple2(0.0, "c")
)).toDF("y", "x")
// index the string column "x"
val indexer = new StringIndexer().setInputCol("x").setOutputCol("xIdx").fit(df)
val indexed = indexer.transform(df)
// build a data frame of label, vectors
val assembler = (new VectorAssembler()).setInputCols(List("xIdx").toArray).setOutputCol("features")
val assembled = assembler.transform(indexed)
// build a logistic regression model and fit it
val logreg = (new LogisticRegression()).setFeaturesCol("features").setLabelCol("y")
val model =
The logistic regression sees this as a model with only one independent variable.
res1: org.apache.spark.mllib.linalg.Vector = [0.7667490491775728]
But the independent variable is categorical with three categories = ["a", "b", "c"]. I know I never did a one of k encoding but the metadata of the data frame knows that the feature vector is nominal.
res2: = {"ml_attr":{"attrs":
How do I pass this information to LogisticRegression? Is this not the whole point of keeping dataframe metadata? There does not seem to be a CategoricalFeaturesInfo in SparkML. Do I really need to do a 1 of k encoding for each categorical feature?

Maybe I am missing something, but this really looks like the job for RFormula (
As the name suggests, it takes an "R-style" formula that describes how the feature vector is composed from the input data columns.
For each categorical input columns (that is, StringType as type) it adds a StringIndexer + OneHotEncoder to the final pipeline implementing the formula under the hoods.
The output is a feature vector (of doubles) that can be used with any algorithm in the package, as the one you are targeting.


Can I use a StringIndexer without One-Hot-Encoding it in PMML (exporting from Spark)?

I'm trying to take a functional, fitted SparkML pipeline (Scala, Spark 2.1.1 for compatibility reasons) and turn it into PMML for interoperability and storage purposes.
At the moment, the pipeline has the following form: Array(StringIndexer,StringIndexer,VectorAssembler,VectorIndexer). I've tried the standard org.jpmml.sparkml.PMMLBuilder which works perfectly fine in situations where I'd already indexed the strings on the database. (I know how many distinct strings there are in these columns, and I'm completely certain that they'll stay categorical.) I'm planning on using them in a decision tree and a few other tree-based methods, and SparkML has lovely treatment of categorical variables in trees that make one-hot-encoding less than ideal.
val strCols = Array("stringCol1","stringCol2")
val strIndexers = => new StringIndexer().setInputCol(c).setOutputCol(c+"_Indexed"))
val collist = df.columns.diff(strCols) ++ => c+"_Indexed")
val vectorAssembler = new VectorAssembler()
val vectorIndexer = new VectorIndexer().setInputCol("rawFeatures").setOutputCol("features").setMaxCategories(35)
val pipeintro = new Pipeline().setStages(strIndexers :+ vectorAssembler :+ vectorIndexer)
val pipeIntro =
val pmmlBuilder = new org.jpmml.sparkml.PMMLBuilder(df.schema, pipeIntro).buildFile(new File("out.pmml"))
I expected the code to complete running and output the appropriate PMML, but what I get instead is:
java.lang.IllegalArgumentException: Field stringCol1 has valid values [MT, IP, OB, GA, ED, OP]
at org.jpmml.converter.PMMLEncoder.toCategorical(
at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(
at org.jpmml.sparkml.FeatureConverter.registerFeatures(
at org.jpmml.sparkml.PMMLBuilder.buildFile(
I've checked for null values; there are none, nor are there other values that are invalid. There's some indication somewhere that StringIndexers are supposed to be one-hot-encoded before being put into a VectorAssembler, but that's suboptimal for this particular pipeline since it's intended to feed into a SparkML-defined tree, which deals well with multi-value categorical columns. Is that guidance hard-coded into PMML or the Spark-PMML encoder? Is there some other error that I'm missing?
This particular exception is about conflicting "stringCol1" definition - its value space has been defined by StringIndexer ([MT, IP, OB, GA, ED, OP]), and now VectorIndexer is trying to re-define it with a different value space. One of those attempts is wrong.
Could be a bug of the JPMML-SparkML library, or your script. Perhaps the output of StringIndexerModel shouldn't be VectorIndexed at all?

How should I convert an RDD of to Dataset?

I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works.
I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes).
Could anyone explain me the right way to do it?
As an example, now I have a RDD[] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a single column named "features" which should contain Vector typed rows. How should I do this?
All you need is an Encoder. Imports
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
val rdd = sc.parallelize(Seq(
linalg.Vectors.dense(1.0, 2.0), linalg.Vectors.sparse(2, Array(), Array())
val ds = spark.createDataset(rdd)(ExpressionEncoder(): Encoder[linalg.Vector])
// +---------+
// | features|
// +---------+
// |[1.0,2.0]|
// |(2,[],[])|
// +---------+
// root
// |-- features: vector (nullable = true)
To convert a RDD to a dataframe, the easiest way is to use toDF() in Scala. To use this function, it is necessary to import implicits which is done using the SparkSession object. It can be done as follows:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = rdd.toDF("features")
toDF() takes an RDD of tuples. When the RDD is built up of common Scala objects they will be implicitly converted, i.e. there is no need to do anything, and when the RDD has multiple columns there is no need to do anything either, the RDD already contains a tuple. However, in this special case you need to first convert RDD[] to RDD[(]. Therefore, it is necessary to do a convertion to tuple as follows:
val df ="features")
The above will convert the RDD to a dataframe with a single column called features.
To convert to a dataset the easiest way is to use a case class. Make sure the case class is defined outside the Main object. First convert the RDD to a dataframe, then do the following:
case class A(features:
val ds =[A]
To show all possible convertions, to access the underlying RDD from a dataframe or dataset can be done using .rdd:
val rdd = df.rdd
Instead of converting back and forth between RDDs and dataframes/datasets it's usually easier to do all the computations using the dataframe API. If there is no suitable function to do what you want, usually it's possible to define an UDF, user defined function. See for example here:

How to use StringIndexer to generate numeric variables?

I was hoping to use StringIndexer as a means of ranking the 1000+ categories in my data set, generating an index which signifies relative frequency. I could then use this index as a numeric feature for my model. Unfortunately StringIndex by default stores some metadata flagging the index as categorical, forcing my model to use the index as a category variable.
Is there some way of disabling this, so the index variable can be used as a numeric variable?
Edit: I am using string indexer as a stage in a ML pipeline, so a solution would need to avoid manipulating the data frame directly. Also I will be saving and loading this pipeline, so a custom data transformer may be impractical. I suspect this isn't possible as Spark is currently written.
You can index the data and then replace the metadata. Let's say your data looks like this:
import spark.implicits._
val indexer = new StringIndexer().setInputCol("raw").setOutputCol("indexed")
val df = Seq("a", "b", "b", "c", "c", "c").toDF("raw")
val indexed =
We'll need a NumericAttribute:
and metadata:
val meta = NumericAttribute.defaultAttr.withName("indexed").toMetadata
Finally we can replace metadata using as method:
indexed.withColumn("indexed", $"indexed".as("indexed", meta))

Spark, ML, StringIndexer: handling unseen labels

My goal is to build a multicalss classifier.
I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step.
The pipeline is fitted the training set.
The test set has to be processed by the fitted pipeline in order to extract the same feature vectors.
Knowing that my test set files have the same structure of the training set. The possible scenario here is to face an unseen class name in the test set, in that case the StringIndexer will fail to find the label, and an exception will be raised.
Is there a solution for this case? or how can we avoid that from happening?
With Spark 2.2 (released 7-2017) you are able to use the .setHandleInvalid("keep") option when creating the indexer. With this option, the indexer adds new indexes when it sees new labels.
val categoryIndexerModel = new StringIndexer()
.setHandleInvalid("keep") // options are "keep", "error" or "skip"
From the documentation: there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:
'error': throws an exception (which is the default)
'skip': skips the rows containing the unseen labels entirely (removes the rows on the output!)
'keep': puts unseen labels in a special additional bucket, at index numLabels
Please see the linked documentation for examples on how the output of StringIndexer looks for the different options.
There's a way around this in Spark 1.6.
Here's the jira:
Here's an example:
val categoryIndexerModel = new StringIndexer()
.setHandleInvalid("skip") // new method. values are "error" or "skip"
I started using this, but ended up going back to KrisP's 2nd bullet point about fitting this particular Estimator to the full dataset.
You'll need this later in the pipeline when you convert the IndexToString.
Here's the modified example:
val categoryIndexerModel = new StringIndexer()
.fit(itemsDF) // Fit the Estimator and create a Model (Transformer)
... do some kind of classification ...
val categoryReverseIndexer = new IndexToString()
.setLabels(categoryIndexerModel.labels) // Use the labels from the Model
No nice way to do it, I'm afraid. Either
filter out the test examples with unknown labels before applying StringIndexer
or fit StringIndexer to the union of train and test dataframe, so you are assured all labels are there
or transform the test example case with unknown label to a known label
Here is some sample code to perform above operations:
// get training labels from original train dataframe
val trainlabels = //Array[String]
// or get labels from a trained StringIndexer model
val trainlabels = simodel.labels
// define an UDF on your dataframe that will be used for filtering
val filterudf = udf { label:String => trainlabels.contains(label)}
// filter out the bad examples
val filteredTestdf = testdf.filter( filterudf(testdf(colname)))
// transform unknown value to some value, say "a"
val mapudf = udf { label:String => if (trainlabels.contains(label)) label else "a"}
// add a new column to testdf:
val transformedTestdf = testdf.withColumn( "newcol", mapudf(testdf(colname)))
In my case, I was running spark ALS on a large data set and the data was not available at all partitions so I had to cache() the data appropriately and it worked like a charm
To me, ignoring the rows completely by setting an argument ( is not really feasible way to solve the issue.
I ended up creating my own CustomStringIndexer transformer which will assign a new value for all new strings that were not encountered while training. You can also do this by changing the relevant portions of the spark feature code(just remove the if condition explicitly checking for this and make it return the length of the array instead) and recompile the jar.
Not really an easy fix, but it certainly is a fix.
I remember seeing a bug in JIRA to incorporate this as well:
It is set to be released with Spark 2.2 though. Just have to wait I guess :S

Apache Spark: Applying a function from sklearn parallel on partitions

I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."
Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:
from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X = [...] # list of texts
y = [...] # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parralelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])
local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
)), y), clf__classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.
Im not 100% sure that I am following, but there are a number of partition methods, such as mapPartitions. These operators hand you the Iterator on each node, and you can do whatever you want to the data and pass it back through a new Iterator
//Spin up something expensive that you only want to do once per node
for(item<-iter) yield {
//do stuff to the items using your expensive item
If your data set is small (it is possible to load it and train on one worker) you can do something like this:
def trainModel[T](modelId: Int, trainingSet: List[T]) = {
//trains model with modelId and returns it
//fake data
val data = List()
val numberOfModels = 100
val broadcastedData = sc.broadcast(data)
val trainedModels = sc.parallelize(Range(0, numberOfModels))
.map(modelId => (modelId, trainModel(modelId, broadcastedData.value)))
I assume you have some list of models (or some how parametrized models) and you can give them ids. Then in function trainModel you pick one depending on id. And as result you will get rdd of pairs of trained models and their ids.
