Linear Regression on Apache Spark - apache-spark

We have a situation where we have to run linear regression on millions of small datasets and store the weights and intercept for each of these datasets. I wrote the below scala code to do so, wherein I fed each of these datasets as a row in an RDD and then I try to run the regression on each(data is the RDD which has (label,features) stored in it in each row, in this case we have one feature per label):
val x = data.flatMap { line => line.split(' ')}.map { line =>
val parts = line.split(',')
val parsedData1 = LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
val model = LinearRegressionWithSGD.train(sc.parallelize(List(parsedData1)),100)//using parallelize to convert data to type RDD
(model.intercept,model.weights)
}
The problem here is that, LinearRegressionWithSGD expects an RDD for input, and nested RDDs are not supported in Spark. I chose this approach as all these datasets can be run independent of each other and hence I wanted to distribute them (Hence, ruled out looping).
Can you please suggest if I can use other types (Arrays, Lists etc) to input as a dataset to LinearRegressionWithSGD or even a better approach which will still distribute such computations in Spark?

val modelList = for {item <- dataSet} yield {
val data = MLUtils.loadLibSVMFile(context, item).cache()
val model = LinearRegressionWithSGD.train(data)
model
}
Maybe you can separate your input data into several files and store in HDFS.
Use the directory of those files as input, you can get a list of models.

Related

Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

How to do parallel model training per partition in spark using scala?
The solution given here is in Pyspark. I'm looking for solution in scala.
How can you efficiently build one ML model per partition in Spark with foreachPartition?
Get the distinct partitions using partition col
Create a threadpool of say 100 threads
create future object for each threads and run
sample code may be as follows-
// Get an ExecutorService
val threadPoolExecutorService = getExecutionContext("name", 100)
// check https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/HasParallelism.scala#L50
val uniquePartitionValues: List[String] = ...//getDistingPartitionsUsingPartitionCol
// Asynchronous invocation to training. The result will be collected from the futures.
val uniquePartitionValuesFutures = uniquePartitionValues.map(partitionValue => {
Future[Double] {
try {
// get dataframe where partitionCol=partitionValue
val partitionDF = mainDF.where(s"partitionCol=$partitionValue")
// do preprocessing and training using any algo with an input partitionDF and return accuracy
} catch {
....
}(threadPoolExecutorService)
})
// Wait for metrics to be calculated
val foldMetrics = uniquePartitionValuesFutures.map(Await.result(_, Duration.Inf))
println(s"output::${foldMetrics.mkString(" ### ")}")

Apply multiple SparkML pipelines to a single DataFrame

I trained several ml pipelines with SparkML and persisted them in HDFS. Now, I want to apply the pipelines to the same dataframe. I implemented a generic scoring class which reads in the pipelines along with the data, applies each of the pipelines to the dataframe and appends the models predictions as new columns. This is my Java code example:
List<PipelineModel> models = readPipelineModels(...)
Dataset<Row> originalDf = spark.read().parquet(...)
Dataset<Row> mergedDf = originalDf;
for (PipelineModel pipelineModel : models) {
Dataset<Row> applyDf = pipelineModel.transform(originalDf);
applyDf = dropDuplicateColumns(applyDf, mergedDf); // drops columns in applyDf which are present in mergedDf
mergedDf = mergedDf.withColumn("rowId", monotonically_increasing_id());
applyDf = applyDf.withColumn("rowId", monotonically_increasing_id());
mergedDf = mergedDf.join(applyDf, "rowId").drop("rowId").cache();
}
I noticed some performance issues especially with large datasets. The join that is used to bind the dataframes is quiet expensive and a lot of shuffling is done between the stages.
Notice that I apply each model to the originalDf instead of mergedDf. If I apply the models to mergedDf in each iteration, I get an error saying that 'column xy is already present' from previous iterations.
Do you have any suggestions to improve the performance of this job?
A few notes:
I wouldn't make two columns of monotonically_increasing_id separately. There's no guarantee that it will increment the same on both. Dataset 1 could get 1, 2, 3 and dataset 2 could get 1000, 2005, 3999
I don't think you need to drop/merge DFs. I would just use the result of the applied model repeatedly.
Something like:
List<PipelineModel> models = readPipelineModels(...);
Dataset<Row> mergedDf = spark.read().parquet(...);
int i = 0;
for (PipelineModel pipelineModel : models) {
i += 1;
mergedDf = pipelineModel.transform(mergedDf);
mergedDf = mergedDf.withColumnRenamed("yourModelOutput", "model_outputs_" + i);
}
FWIW I'm used to PySpark and kind of translating in my head - but that's the gist of how you'd solve it.

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

Cluster centers have different dimensionality in Spark MLlib

I use Spark 2.0.2. I have data which is partitioned by day. I want to cluster the different partitions independent from each other and than compare the cluster centers (calculate distance between them) to see, how the clusters would change over time.
I do exactly the same preprocessing (scaling, one hot encoding, etc.) for every partition. I use a predefined pipeline for this which works perfectly in "normal" learning and prediction contexts. But when I want to calculate the distance between the cluster centers, the corresponding vectors of the different partitions have a different size (different dimensionality).
Some code snippets:
The preprocessing pipeline is built like this:
val protoIndexer = new StringIndexer().setInputCol("protocol").setOutputCol("protocolIndexed").setHandleInvalid("skip")
val serviceIndexer = new StringIndexer().setInputCol("service").setOutputCol("serviceIndexed").setHandleInvalid("skip")
val directionIndexer = new StringIndexer().setInputCol("direction").setOutputCol("directionIndexed").setHandleInvalid("skip")
val protoEncoder = new OneHotEncoder().setInputCol("protocolIndexed").setOutputCol("protocolEncoded")
val serviceEncoder = new OneHotEncoder().setInputCol("serviceIndexed").setOutputCol("serviceEncoded")
val directionEncoder = new OneHotEncoder().setInputCol("directionIndexed").setOutputCol("directionEncoded")
val scaleAssembler = new VectorAssembler().setInputCols(Array("duration", "bytes", "packets", "tos", "host_count", "srv_count")).setOutputCol("scalableFeatures")
val scaler = new StandardScaler().setInputCol("scalableFeatures").setOutputCol("scaledFeatures")
val featureAssembler = new VectorAssembler().setInputCols(Array("scaledFeatures", "protocolEncoded", "urgent", "ack", "psh", "rst", "syn", "fin", "serviceEncoded", "directionEncoded")).setOutputCol("features")
val pipeline = new Pipeline().setStages(Array(protoIndexer, protoEncoder, serviceIndexer, serviceEncoder, directionIndexer, directionEncoder, scaleAssembler, scaler, featureAssembler))
pipeline.write.overwrite().save(config.getString("pipeline"))
Define k-means, load predefined preprocessing-pipeline, add k-means to the pipeline:
val kmeans = new KMeans().setK(40).setTol(1.0e-6).setFeaturesCol("features")
val pipelineStages = Pipeline.load(config.getString("pipeline")).getStages
val pipeline = new Pipeline().setStages(pipelineStages ++ Array(kmeans))
Load data partitions, calculate features, fit pipeline, get k-means model and show size of the first cluster center as example:
(1 to 7 by 1).map { day =>
val data = sparkContext.textFile("path/to/data/" + day + "/")
val rawFeatures = data.map(extractFeatures....).toDF(featureHeaders: _*)
val model = pipeline.fit(rawFeatures)
val kmeansModel = model.stages(model.stages.size - 1).asInstanceOf[KMeansModel]
println(kmeansModel.clusterCenters(0).size)
}
For the different partitions, the cluster centers have different dimensions (But the same for every of the 40 clusters within a partition). So I can't calculate the distance between them. I would suspect that they are all of equal size (namely the size of my euclidean space which is 13, because I have 13 features). But it gives my weird numbers which I don't understand.
I saved the extracted feature vectors to a file to check them. Their format is as suspected. Every feature is present.
Any ideas what I'm doing wrong or if I have a misunderstanding? Thank you!
Skipping over the fact that KMeans is not a good choice for processing categorical data your code doesn't guarantee:
The same index - feature relationship between batches. StringIndexer assigns labels following frequencies. The most common string is encoded as 0, the least common one as a numLabels - 1.
The same number of inidces between batches, and as a result the same shape of one-hot-encoded and assembled vectors. Size of the vector is equal to the number of unique labels adjusted depending on the value of dropLast parameter in OneHotEncoder.
In consequence encoded vectors may have different dimensions and interpretation from batch to batch.
If you want consistent encoding you'll need persistent dictionary mapping which ensure consistent indexing between batches.

How to eval spark.ml model without DataFrames/SparkContext?

With Spark MLLib, I'd build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features.
It seems like with Spark ML, predict is now called transform and only acts on a DataFrame.
Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame?
Am I missing something?
Re: Is there any way to build a DataFrame outside of Spark?
It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.
Here is my solution to use spark models outside of spark context (using PMML):
You create model with a pipeline like this:
SparkConf sparkConf = new SparkConf();
SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));
Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result = evaluator.evaluate(args);
Spent days on this problem too. It's not straightforward. My third suggestion involves code I have written specifically for this purpose.
Option 1
As other commenters have said, predict(Vector) is now available. However, you need to know how to construct a vector. If you don't, see Option 3.
Option 2
If the goal is to avoid setting up a Spark server (standalone or cluster modes), then its possible to start Spark in local mode. The whole thing will run inside a single JVM.
val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
// create dataframe from file, or make it up from some data in memory
// use model.transform() to get predictions
But this brings unnecessary dependencies to your prediction module, and it consumes resources in your JVM at runtime. Also, if prediction latency is critical, for example making a prediction within a millisecond as soon as a request comes in, then this option is too slow.
Option 3
MLlib FeatureHasher's output can be used as an input to your learner. The class is good for one hot encoding and also for fixing the size of your feature dimension. You can use it even when all your features are numerical. If you use that in your training, then all you need at prediction time is the hashing logic there. Its implemented as a spark transformer so it's not easy to re-use outside of a spark environment. So I have done the work of pulling out the hashing function to a lib. You apply FeatureHasher and your learner during training as normal. Then here's how you use the slimmed down hasher at prediction time:
// Schema and hash size must stay consistent across training and prediction
val hasher = new FeatureHasherLite(mySchema, myHashSize)
// create sample data-point and hash it
val feature = Map("feature1" -> "value1", "feature2" -> 2.0, "feature3" -> 3, "feature4" -> false)
val featureVector = hasher.hash(feature)
// Make prediction
val prediction = model.predict(featureVector)
You can see details in my github at tilayealemu/sparkmllite. If you'd rather copy my code, take a look at FeatureHasherLite.scala.There are sample codes and unit tests too. Feel free to create an issue if you need help.

Resources