With Spark MLLib, I'd build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features.
It seems like with Spark ML, predict is now called transform and only acts on a DataFrame.
Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame?
Re: Is there any way to build a DataFrame outside of Spark?
It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.

Here is my solution to use spark models outside of spark context (using PMML):
You create model with a pipeline like this:
SparkConf sparkConf = new SparkConf();
SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();
String tableName = "schema.table";
Properties dbProperties = new Properties();
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet =;
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
PipelineModel pmodel =;
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));
Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result = evaluator.evaluate(args);

Spent days on this problem too. It's not straightforward. My third suggestion involves code I have written specifically for this purpose.
Option 1
As other commenters have said, predict(Vector) is now available. However, you need to know how to construct a vector. If you don't, see Option 3.
Option 2
If the goal is to avoid setting up a Spark server (standalone or cluster modes), then its possible to start Spark in local mode. The whole thing will run inside a single JVM.
val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
// create dataframe from file, or make it up from some data in memory
// use model.transform() to get predictions
But this brings unnecessary dependencies to your prediction module, and it consumes resources in your JVM at runtime. Also, if prediction latency is critical, for example making a prediction within a millisecond as soon as a request comes in, then this option is too slow.
Option 3
MLlib FeatureHasher's output can be used as an input to your learner. The class is good for one hot encoding and also for fixing the size of your feature dimension. You can use it even when all your features are numerical. If you use that in your training, then all you need at prediction time is the hashing logic there. Its implemented as a spark transformer so it's not easy to re-use outside of a spark environment. So I have done the work of pulling out the hashing function to a lib. You apply FeatureHasher and your learner during training as normal. Then here's how you use the slimmed down hasher at prediction time:
// Schema and hash size must stay consistent across training and prediction
val hasher = new FeatureHasherLite(mySchema, myHashSize)
// create sample data-point and hash it
val feature = Map("feature1" -> "value1", "feature2" -> 2.0, "feature3" -> 3, "feature4" -> false)
val featureVector = hasher.hash(feature)
// Make prediction
val prediction = model.predict(featureVector)
You can see details in my github at tilayealemu/sparkmllite. If you'd rather copy my code, take a look at FeatureHasherLite.scala.There are sample codes and unit tests too. Feel free to create an issue if you need help.


Can I use a StringIndexer without One-Hot-Encoding it in PMML (exporting from Spark)?

I'm trying to take a functional, fitted SparkML pipeline (Scala, Spark 2.1.1 for compatibility reasons) and turn it into PMML for interoperability and storage purposes.
At the moment, the pipeline has the following form: Array(StringIndexer,StringIndexer,VectorAssembler,VectorIndexer). I've tried the standard org.jpmml.sparkml.PMMLBuilder which works perfectly fine in situations where I'd already indexed the strings on the database. (I know how many distinct strings there are in these columns, and I'm completely certain that they'll stay categorical.) I'm planning on using them in a decision tree and a few other tree-based methods, and SparkML has lovely treatment of categorical variables in trees that make one-hot-encoding less than ideal.
val strCols = Array("stringCol1","stringCol2")
val strIndexers = => new StringIndexer().setInputCol(c).setOutputCol(c+"_Indexed"))
val collist = df.columns.diff(strCols) ++ => c+"_Indexed")
val vectorAssembler = new VectorAssembler()
val vectorIndexer = new VectorIndexer().setInputCol("rawFeatures").setOutputCol("features").setMaxCategories(35)
val pipeintro = new Pipeline().setStages(strIndexers :+ vectorAssembler :+ vectorIndexer)
val pipeIntro =
val pmmlBuilder = new org.jpmml.sparkml.PMMLBuilder(df.schema, pipeIntro).buildFile(new File("out.pmml"))
I expected the code to complete running and output the appropriate PMML, but what I get instead is:
java.lang.IllegalArgumentException: Field stringCol1 has valid values [MT, IP, OB, GA, ED, OP]
at org.jpmml.converter.PMMLEncoder.toCategorical(
at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(
at org.jpmml.sparkml.FeatureConverter.registerFeatures(
at org.jpmml.sparkml.PMMLBuilder.buildFile(
I've checked for null values; there are none, nor are there other values that are invalid. There's some indication somewhere that StringIndexers are supposed to be one-hot-encoded before being put into a VectorAssembler, but that's suboptimal for this particular pipeline since it's intended to feed into a SparkML-defined tree, which deals well with multi-value categorical columns. Is that guidance hard-coded into PMML or the Spark-PMML encoder? Is there some other error that I'm missing?
This particular exception is about conflicting "stringCol1" definition - its value space has been defined by StringIndexer ([MT, IP, OB, GA, ED, OP]), and now VectorIndexer is trying to re-define it with a different value space. One of those attempts is wrong.
Could be a bug of the JPMML-SparkML library, or your script. Perhaps the output of StringIndexerModel shouldn't be VectorIndexed at all?

How to use L1 penalty in for features selection?

Firstly, I use spark 1.6.0. I want to use L1 penalty in for features selection.
But I can not get the detailed coefficients when calling the function:
lr = LogisticRegression(elasticNetParam=1.0, regParam=0.01,maxIter=100,fitIntercept=False,standardization=False)
model =
print model.coefficients.toArray().astype(float).tolist()
I only get sparse list like:
While when I use sklearn.linear_model.LogisticRegression model, I can get the detailed list without zero value in coef_ list like:
With the better performance in spark, I could finished my work faster. I just want to use L1 penalty for feature selection.
I think I should use more detailed values of coefficients for my feature selection work just as sklearn does, how can I solve my problem?
Below is a working code snip in Spark 2.1.
The key to extract values is :
Spark 1.6 may have something similar.
val holIndIndexer = new StringIndexer().setInputCol("holInd").setOutputCol("holIndIndexer")
val holIndEncoder = new OneHotEncoder().setInputCol("holIndIndexer").setOutputCol("holIndVec")
val time_intervaLEncoder = new OneHotEncoder().setInputCol("time_interval").setOutputCol("time_intervaLVec")
val assemblerL1 = (new VectorAssembler()
.setInputCols(Array("time_intervaLVec", "holIndVec", "length")).setOutputCol("features") )
val lrL1 = new LinearRegression().setFeaturesCol("features").setLabelCol("travel_time")
val pipelineL1 = new Pipeline().setStages(Array(holIndIndexer,holIndEncoder,time_intervaLEncoder,assemblerL1, lrL1))
val modelL1 =
val l1Coeff =modelL1.stages(4).asInstanceOf[LinearRegressionModel].coefficients

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

Supossed I have a Pipeline like this:
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features")
val idf = new IDF().setInputCol("features").setOutputCol("idffeatures")
val nb = new
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(10)
val cvModel =
As you can see I defined a CrossValidator using a MultiClassClassificationEvaluator. I have seen a lot of examples getting metrics like Precision/Recall during testing process but these metris are gotten when you use a different set of data for testing purposes (See for example this documentation).
From my understanding, CrossValidator is going to create folds and one fold will be use for testing purposes, then CrossValidator will choose the best model. My question is, is possible to get Precision/Recall metrics during training process?
Well, the only metric which is actually stored is the one you define when you create an instance of an Evaluator. For the BinaryClassificationEvaluator this can take one of the two values:
with the former one being default, and can be set using setMetricName method.
These values are collected during training process and can accessed using CrossValidatorModel.avgMetrics. Order of values corresponds to the order of EstimatorParamMaps (CrossValidatorModel.getEstimatorParamMaps).

Spark, ML, StringIndexer: handling unseen labels

My goal is to build a multicalss classifier.
I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step.
The pipeline is fitted the training set.
The test set has to be processed by the fitted pipeline in order to extract the same feature vectors.
Knowing that my test set files have the same structure of the training set. The possible scenario here is to face an unseen class name in the test set, in that case the StringIndexer will fail to find the label, and an exception will be raised.
Is there a solution for this case? or how can we avoid that from happening?
With Spark 2.2 (released 7-2017) you are able to use the .setHandleInvalid("keep") option when creating the indexer. With this option, the indexer adds new indexes when it sees new labels.
val categoryIndexerModel = new StringIndexer()
.setHandleInvalid("keep") // options are "keep", "error" or "skip"
From the documentation: there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:
'error': throws an exception (which is the default)
'skip': skips the rows containing the unseen labels entirely (removes the rows on the output!)
'keep': puts unseen labels in a special additional bucket, at index numLabels
Please see the linked documentation for examples on how the output of StringIndexer looks for the different options.
There's a way around this in Spark 1.6.
Here's the jira:
Here's an example:
val categoryIndexerModel = new StringIndexer()
.setHandleInvalid("skip") // new method. values are "error" or "skip"
I started using this, but ended up going back to KrisP's 2nd bullet point about fitting this particular Estimator to the full dataset.
You'll need this later in the pipeline when you convert the IndexToString.
Here's the modified example:
val categoryIndexerModel = new StringIndexer()
.fit(itemsDF) // Fit the Estimator and create a Model (Transformer)
... do some kind of classification ...
val categoryReverseIndexer = new IndexToString()
.setLabels(categoryIndexerModel.labels) // Use the labels from the Model
No nice way to do it, I'm afraid. Either
filter out the test examples with unknown labels before applying StringIndexer
or fit StringIndexer to the union of train and test dataframe, so you are assured all labels are there
or transform the test example case with unknown label to a known label
Here is some sample code to perform above operations:
// get training labels from original train dataframe
val trainlabels = //Array[String]
// or get labels from a trained StringIndexer model
val trainlabels = simodel.labels
// define an UDF on your dataframe that will be used for filtering
val filterudf = udf { label:String => trainlabels.contains(label)}
// filter out the bad examples
val filteredTestdf = testdf.filter( filterudf(testdf(colname)))
// transform unknown value to some value, say "a"
val mapudf = udf { label:String => if (trainlabels.contains(label)) label else "a"}
// add a new column to testdf:
val transformedTestdf = testdf.withColumn( "newcol", mapudf(testdf(colname)))
In my case, I was running spark ALS on a large data set and the data was not available at all partitions so I had to cache() the data appropriately and it worked like a charm
To me, ignoring the rows completely by setting an argument ( is not really feasible way to solve the issue.
I ended up creating my own CustomStringIndexer transformer which will assign a new value for all new strings that were not encountered while training. You can also do this by changing the relevant portions of the spark feature code(just remove the if condition explicitly checking for this and make it return the length of the array instead) and recompile the jar.
Not really an easy fix, but it certainly is a fix.
I remember seeing a bug in JIRA to incorporate this as well:
It is set to be released with Spark 2.2 though. Just have to wait I guess :S

How to use RandomForest in Spark Pipeline

I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice?
However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest you should use ml.classification.RandomForestClassifier. Here is an example based on the one from MLlib docs.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = => Record(lp.label.toString, lp.features)).toDF
val testDF = => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
val rf = new RandomForestClassifier()
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model =
There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints directly, but for some reason it doesn't work and raises IllegalArgumentExcetion:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
Hence the ugly trick with StringIndexer. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}) but some classes in ml seem to work just fine without it.
