Cluster centers have different dimensionality in Spark MLlib - apache-spark

I use Spark 2.0.2. I have data which is partitioned by day. I want to cluster the different partitions independent from each other and than compare the cluster centers (calculate distance between them) to see, how the clusters would change over time.
I do exactly the same preprocessing (scaling, one hot encoding, etc.) for every partition. I use a predefined pipeline for this which works perfectly in "normal" learning and prediction contexts. But when I want to calculate the distance between the cluster centers, the corresponding vectors of the different partitions have a different size (different dimensionality).
Some code snippets:
The preprocessing pipeline is built like this:
val protoIndexer = new StringIndexer().setInputCol("protocol").setOutputCol("protocolIndexed").setHandleInvalid("skip")
val serviceIndexer = new StringIndexer().setInputCol("service").setOutputCol("serviceIndexed").setHandleInvalid("skip")
val directionIndexer = new StringIndexer().setInputCol("direction").setOutputCol("directionIndexed").setHandleInvalid("skip")
val protoEncoder = new OneHotEncoder().setInputCol("protocolIndexed").setOutputCol("protocolEncoded")
val serviceEncoder = new OneHotEncoder().setInputCol("serviceIndexed").setOutputCol("serviceEncoded")
val directionEncoder = new OneHotEncoder().setInputCol("directionIndexed").setOutputCol("directionEncoded")
val scaleAssembler = new VectorAssembler().setInputCols(Array("duration", "bytes", "packets", "tos", "host_count", "srv_count")).setOutputCol("scalableFeatures")
val scaler = new StandardScaler().setInputCol("scalableFeatures").setOutputCol("scaledFeatures")
val featureAssembler = new VectorAssembler().setInputCols(Array("scaledFeatures", "protocolEncoded", "urgent", "ack", "psh", "rst", "syn", "fin", "serviceEncoded", "directionEncoded")).setOutputCol("features")
val pipeline = new Pipeline().setStages(Array(protoIndexer, protoEncoder, serviceIndexer, serviceEncoder, directionIndexer, directionEncoder, scaleAssembler, scaler, featureAssembler))
pipeline.write.overwrite().save(config.getString("pipeline"))
Define k-means, load predefined preprocessing-pipeline, add k-means to the pipeline:
val kmeans = new KMeans().setK(40).setTol(1.0e-6).setFeaturesCol("features")
val pipelineStages = Pipeline.load(config.getString("pipeline")).getStages
val pipeline = new Pipeline().setStages(pipelineStages ++ Array(kmeans))
Load data partitions, calculate features, fit pipeline, get k-means model and show size of the first cluster center as example:
(1 to 7 by 1).map { day =>
val data = sparkContext.textFile("path/to/data/" + day + "/")
val rawFeatures = data.map(extractFeatures....).toDF(featureHeaders: _*)
val model = pipeline.fit(rawFeatures)
val kmeansModel = model.stages(model.stages.size - 1).asInstanceOf[KMeansModel]
println(kmeansModel.clusterCenters(0).size)
}
For the different partitions, the cluster centers have different dimensions (But the same for every of the 40 clusters within a partition). So I can't calculate the distance between them. I would suspect that they are all of equal size (namely the size of my euclidean space which is 13, because I have 13 features). But it gives my weird numbers which I don't understand.
I saved the extracted feature vectors to a file to check them. Their format is as suspected. Every feature is present.
Any ideas what I'm doing wrong or if I have a misunderstanding? Thank you!

Skipping over the fact that KMeans is not a good choice for processing categorical data your code doesn't guarantee:
The same index - feature relationship between batches. StringIndexer assigns labels following frequencies. The most common string is encoded as 0, the least common one as a numLabels - 1.
The same number of inidces between batches, and as a result the same shape of one-hot-encoded and assembled vectors. Size of the vector is equal to the number of unique labels adjusted depending on the value of dropLast parameter in OneHotEncoder.
In consequence encoded vectors may have different dimensions and interpretation from batch to batch.
If you want consistent encoding you'll need persistent dictionary mapping which ensure consistent indexing between batches.

Related

Apply multiple SparkML pipelines to a single DataFrame

I trained several ml pipelines with SparkML and persisted them in HDFS. Now, I want to apply the pipelines to the same dataframe. I implemented a generic scoring class which reads in the pipelines along with the data, applies each of the pipelines to the dataframe and appends the models predictions as new columns. This is my Java code example:
List<PipelineModel> models = readPipelineModels(...)
Dataset<Row> originalDf = spark.read().parquet(...)
Dataset<Row> mergedDf = originalDf;
for (PipelineModel pipelineModel : models) {
Dataset<Row> applyDf = pipelineModel.transform(originalDf);
applyDf = dropDuplicateColumns(applyDf, mergedDf); // drops columns in applyDf which are present in mergedDf
mergedDf = mergedDf.withColumn("rowId", monotonically_increasing_id());
applyDf = applyDf.withColumn("rowId", monotonically_increasing_id());
mergedDf = mergedDf.join(applyDf, "rowId").drop("rowId").cache();
}
I noticed some performance issues especially with large datasets. The join that is used to bind the dataframes is quiet expensive and a lot of shuffling is done between the stages.
Notice that I apply each model to the originalDf instead of mergedDf. If I apply the models to mergedDf in each iteration, I get an error saying that 'column xy is already present' from previous iterations.
Do you have any suggestions to improve the performance of this job?
A few notes:
I wouldn't make two columns of monotonically_increasing_id separately. There's no guarantee that it will increment the same on both. Dataset 1 could get 1, 2, 3 and dataset 2 could get 1000, 2005, 3999
I don't think you need to drop/merge DFs. I would just use the result of the applied model repeatedly.
Something like:
List<PipelineModel> models = readPipelineModels(...);
Dataset<Row> mergedDf = spark.read().parquet(...);
int i = 0;
for (PipelineModel pipelineModel : models) {
i += 1;
mergedDf = pipelineModel.transform(mergedDf);
mergedDf = mergedDf.withColumnRenamed("yourModelOutput", "model_outputs_" + i);
}
FWIW I'm used to PySpark and kind of translating in my head - but that's the gist of how you'd solve it.

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less.
In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark.
Here it is:
df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data)
dataframe = sqlContext.createDataFrame(df)
train, test = dataframe.randomSplit([0.7, 0.3])
columnTypes = dataframe.dtypes
for ct in columnTypes:
if ct[1] == 'string' and ct[0] != 'label':
categoricalCols += [ct[0]]
elif ct[0] != 'label':
numericCols += [ct[0]]
stages = []
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
stages += [stringIndexer]
assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel', handleInvalid='skip')
stages += [labelIndexer]
estimator = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [estimator]
parameters = {"maxDepth" : [3, 5, 10, 15], "maxBins" : [6, 12, 24, 32], "numTrees" : [3, 5, 10]}
paramGrid = ParamGridBuilder()
for key, value in parameters.iteritems():
paramGrid.addGrid(estimator.getParam(key), value)
estimatorParamMaps = (paramGrid.build())
pipeline = Pipeline(stages=stages)
crossValidator = CrossValidator(estimator=pipeline, estimatorParamMaps=estimatorParamMaps, evaluator=MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1'), numFolds=3)
pipelineModel = crossValidator.fit(train)
predictions = pipelineModel.transform(test)
evaluator = pipeline.getEvaluator().evaluate(predictions)
Thanks in advance, any comments/suggestions are highly appreciated :)
The following may not solve your problem completely but it should give you some pointer to start.
The first problem that you are facing is the disproportion between the amount of data and the resources.
This means that since you are parallelizing a local collection (pandas dataframe), Spark will use the default parallelism configuration. Which is most likely to be resulting in 48 partitions with less than 0.5mb per partition. (Spark doesn't do well with small files nor small partitions)
The second problem is related to expensive optimizations/approximations techniques used by Tree models in Spark.
Spark tree models use some tricks to optimally bucket continuous variables. With small data it is way cheaper to just get the exact splits.
It mainly uses approximated quantiles in this case.
Usually, in a single machine framework scenario, like scikit, the tree model uses unique feature values for continuous features as splits candidates for the best fit calculation. Whereas in Apache Spark, the tree model uses quantiles for each feature as a split candidate.
Also to add that you shouldn't forget as well that cross validation is a heavy and long tasks as it's proportional to the combination of your 3 hyper-parameters times the number of folds times the time spent to train each model (GridSearch approach). You might want to cache your data per example for a start but it will still not gain you much time. I believe that spark is an overkill for this amount of data. You might want to use scikit learn instead and maybe use spark-sklearn to distributed local model training.
Spark will learn each model separately and sequentially with the hypothesis that data is distributed and big.
You can of course optimize performance using columnar data based file formats like parquet and tuning spark itself, etc. it's too broad to talk about it here.
You can read more about tree models scalability with spark-mllib in this following blogpost :
Scalable Decision Trees in MLlib

How does Spark keep track of the splits in randomSplit?

This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split.
If we look at the implementation of randomSplit:
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
// It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
// constituent partitions each time a split is materialized which could result in
// overlapping splits. To prevent this, we explicitly sort each input partition to make the
// ordering deterministic.
val sorted = Sort(logicalPlan.output.map(SortOrder(_, Ascending)), global = false, logicalPlan)
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new DataFrame(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted))
}.toArray
}
we can see that it creates two DataFrames that share the same sqlContext and with two different Sample(rs).
How are these two DataFrame(s) communicating with each other so that a value that fell in the first one is not included in the second one?
And is the data being fetched twice? (Assume the sqlContext is selecting from a DB, is the select being executed twice?).
It's exactly the same as sampling an RDD.
Assuming you have the weight array (0.6, 0.2, 0.2), Spark will generate one DataFrame for each range (0.0, 0.6), (0.6, 0.8), (0.8, 1.0).
When it's time to read the result DataFrame, Spark will just go over the parent DataFrame. For each item, generate a random number, if that number fall in the the specified range, then emit the item. All child DataFrame share the same random number generator (technically, different generators with the same seed), so the sequence of random number is deterministic.
For your last question, if you did not cache the parent DataFrame, then the data for the input DataFrame will be re-fetch each time an output DataFrame is computed.

Spark, ML, StringIndexer: handling unseen labels

My goal is to build a multicalss classifier.
I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step.
The pipeline is fitted the training set.
The test set has to be processed by the fitted pipeline in order to extract the same feature vectors.
Knowing that my test set files have the same structure of the training set. The possible scenario here is to face an unseen class name in the test set, in that case the StringIndexer will fail to find the label, and an exception will be raised.
Is there a solution for this case? or how can we avoid that from happening?
With Spark 2.2 (released 7-2017) you are able to use the .setHandleInvalid("keep") option when creating the indexer. With this option, the indexer adds new indexes when it sees new labels.
val categoryIndexerModel = new StringIndexer()
.setInputCol("category")
.setOutputCol("indexedCategory")
.setHandleInvalid("keep") // options are "keep", "error" or "skip"
From the documentation: there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:
'error': throws an exception (which is the default)
'skip': skips the rows containing the unseen labels entirely (removes the rows on the output!)
'keep': puts unseen labels in a special additional bucket, at index numLabels
Please see the linked documentation for examples on how the output of StringIndexer looks for the different options.
There's a way around this in Spark 1.6.
Here's the jira: https://issues.apache.org/jira/browse/SPARK-8764
Here's an example:
val categoryIndexerModel = new StringIndexer()
.setInputCol("category")
.setOutputCol("indexedCategory")
.setHandleInvalid("skip") // new method. values are "error" or "skip"
I started using this, but ended up going back to KrisP's 2nd bullet point about fitting this particular Estimator to the full dataset.
You'll need this later in the pipeline when you convert the IndexToString.
Here's the modified example:
val categoryIndexerModel = new StringIndexer()
.setInputCol("category")
.setOutputCol("indexedCategory")
.fit(itemsDF) // Fit the Estimator and create a Model (Transformer)
... do some kind of classification ...
val categoryReverseIndexer = new IndexToString()
.setInputCol(classifier.getPredictionCol)
.setOutputCol("predictedCategory")
.setLabels(categoryIndexerModel.labels) // Use the labels from the Model
No nice way to do it, I'm afraid. Either
filter out the test examples with unknown labels before applying StringIndexer
or fit StringIndexer to the union of train and test dataframe, so you are assured all labels are there
or transform the test example case with unknown label to a known label
Here is some sample code to perform above operations:
// get training labels from original train dataframe
val trainlabels = traindf.select(colname).distinct.map(_.getString(0)).collect //Array[String]
// or get labels from a trained StringIndexer model
val trainlabels = simodel.labels
// define an UDF on your dataframe that will be used for filtering
val filterudf = udf { label:String => trainlabels.contains(label)}
// filter out the bad examples
val filteredTestdf = testdf.filter( filterudf(testdf(colname)))
// transform unknown value to some value, say "a"
val mapudf = udf { label:String => if (trainlabels.contains(label)) label else "a"}
// add a new column to testdf:
val transformedTestdf = testdf.withColumn( "newcol", mapudf(testdf(colname)))
In my case, I was running spark ALS on a large data set and the data was not available at all partitions so I had to cache() the data appropriately and it worked like a charm
To me, ignoring the rows completely by setting an argument (https://issues.apache.org/jira/browse/SPARK-8764) is not really feasible way to solve the issue.
I ended up creating my own CustomStringIndexer transformer which will assign a new value for all new strings that were not encountered while training. You can also do this by changing the relevant portions of the spark feature code(just remove the if condition explicitly checking for this and make it return the length of the array instead) and recompile the jar.
Not really an easy fix, but it certainly is a fix.
I remember seeing a bug in JIRA to incorporate this as well: https://issues.apache.org/jira/browse/SPARK-17498
It is set to be released with Spark 2.2 though. Just have to wait I guess :S

Linear Regression on Apache Spark

We have a situation where we have to run linear regression on millions of small datasets and store the weights and intercept for each of these datasets. I wrote the below scala code to do so, wherein I fed each of these datasets as a row in an RDD and then I try to run the regression on each(data is the RDD which has (label,features) stored in it in each row, in this case we have one feature per label):
val x = data.flatMap { line => line.split(' ')}.map { line =>
val parts = line.split(',')
val parsedData1 = LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
val model = LinearRegressionWithSGD.train(sc.parallelize(List(parsedData1)),100)//using parallelize to convert data to type RDD
(model.intercept,model.weights)
}
The problem here is that, LinearRegressionWithSGD expects an RDD for input, and nested RDDs are not supported in Spark. I chose this approach as all these datasets can be run independent of each other and hence I wanted to distribute them (Hence, ruled out looping).
Can you please suggest if I can use other types (Arrays, Lists etc) to input as a dataset to LinearRegressionWithSGD or even a better approach which will still distribute such computations in Spark?
val modelList = for {item <- dataSet} yield {
val data = MLUtils.loadLibSVMFile(context, item).cache()
val model = LinearRegressionWithSGD.train(data)
model
}
Maybe you can separate your input data into several files and store in HDFS.
Use the directory of those files as input, you can get a list of models.

Resources