optimal pyspark ML pipeline setup to reduce time to run - apache-spark

I have a Pipeline with close to 100 models, assembled like so
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
assembler = VectorAssembler(inputCols=feat_cols, outputCol='features')
scaler = MinMaxScaler(inputCol='features', outputCol='features_scaled')
pca = PCA(k=25, inputCol='features_scaled', outputCol='pca_output')
transformer_pipe = Pipeline(stages=[assembler, scaler, pca])
transformer = transformer_pipe.fit(train)
train_transformed = transformer.transform(train)
test_transformed = transformer.transform(test)
models = []
for c in cols:
models.append(LogisticRegression(
regParam=1.,
featuresCol='pca_output',
labelCol= c,
predictionCol=f'prediction_{c}',
rawPredictionCol=f'raw_prediction_{c}',
probabilityCol=f'probability_{c}',
weightCol=f'weight_{c}',
family='binomial'
)
)
pipeline = Pipeline(stages=models)
at this point I invoke the fit and transform methods to train the models and get my predictions
pipe = pipeline.fit(train_transformed)
preds = pipe.transform(test_transformed)
And yet, what seems like a straightforward invocation takes ages and ages on an EMR cluster with 100s of cores (1 master instances, up to 9 core instances, set up with autoscaling with 64 vCPUs each).
I've run into OOM memories and have played with the session parameters in the notebook calling
%%configure -f
{"conf":{"spark.executor.memory":"12g",
"spark.driver.memory":"12g",
"spark.driver.cores":"3",
"spark.driver.memoryOverhead":"2048",
"spark.executor.cores":"3"}
}
before the session begins, and while it no longer runs into memory issues, the end-to-end process still takes a very long time. Any idea how I can speed this process up? Thanks in advance!

Related

why exactly should we avoid using for loops in PySpark?

I was attempting to speed up some of my pipelines but couldn't get a precise answer. Are some for loops OK, depending on the implementation? When is it OK to use a loop without taking too much of a performance hit? I've read
This nice article by David Mudrauskas
This nice Stack Overflow answer
The Spark RDD docs, which advises
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode.
If we were to use a for loop to step through and train a series of models, persisting them in the models dict,
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
dv = ['y1','y2','y3', ...]
models = {}
for v in dv:
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol=v,predictionCol=f'prediction_{v}')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
would that be meaningfully slower than enumerating and training them each explicitly like below? are they equivalent?
# y1
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol='y1',predictionCol=f'prediction_y1')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
#y2
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol='y2',predictionCol=f'prediction_y2')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
#y3
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol='y3',predictionCol=f'prediction_y3')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
...
my understanding is that the SparkML library has parallelism built in, but I'm wondering if using the loop degrades this parallelism, and if there is a better way to train models in parallel. It's very slow on my setup, so maybe I'm doing something wrong... Thanks in advance!
Both approaches are the same. Irrespective of the approach, the parallelism depends on the number of cores you have across your Executors. You can read more in this article: https://www.javacodegeeks.com/2018/10/anatomy-apache-spark-job.html

How to save the model after doing pipeline fit?

I wrote this code in Spark ML
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.Pipeline
val lr = new LogisticRegression()
val pipeline = new Pipeline()
.setStages(Array(fooIndexer, fooHotEncoder, assembler, lr))
val model = pipeline.fit(training)
This code takes a long time to run. Is it possible that after running pipeline.fit I save the model on HDFS so that I don't have to run it again and again?
Edit: Also, how to load it back from HDFS when I have to apply transform on the model so that I can make predictions.
Straight from the official documentation - saving:
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
and loading:
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
Related:
Save ML model for future usage

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less.
In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark.
Here it is:
df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data)
dataframe = sqlContext.createDataFrame(df)
train, test = dataframe.randomSplit([0.7, 0.3])
columnTypes = dataframe.dtypes
for ct in columnTypes:
if ct[1] == 'string' and ct[0] != 'label':
categoricalCols += [ct[0]]
elif ct[0] != 'label':
numericCols += [ct[0]]
stages = []
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
stages += [stringIndexer]
assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel', handleInvalid='skip')
stages += [labelIndexer]
estimator = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [estimator]
parameters = {"maxDepth" : [3, 5, 10, 15], "maxBins" : [6, 12, 24, 32], "numTrees" : [3, 5, 10]}
paramGrid = ParamGridBuilder()
for key, value in parameters.iteritems():
paramGrid.addGrid(estimator.getParam(key), value)
estimatorParamMaps = (paramGrid.build())
pipeline = Pipeline(stages=stages)
crossValidator = CrossValidator(estimator=pipeline, estimatorParamMaps=estimatorParamMaps, evaluator=MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1'), numFolds=3)
pipelineModel = crossValidator.fit(train)
predictions = pipelineModel.transform(test)
evaluator = pipeline.getEvaluator().evaluate(predictions)
Thanks in advance, any comments/suggestions are highly appreciated :)
The following may not solve your problem completely but it should give you some pointer to start.
The first problem that you are facing is the disproportion between the amount of data and the resources.
This means that since you are parallelizing a local collection (pandas dataframe), Spark will use the default parallelism configuration. Which is most likely to be resulting in 48 partitions with less than 0.5mb per partition. (Spark doesn't do well with small files nor small partitions)
The second problem is related to expensive optimizations/approximations techniques used by Tree models in Spark.
Spark tree models use some tricks to optimally bucket continuous variables. With small data it is way cheaper to just get the exact splits.
It mainly uses approximated quantiles in this case.
Usually, in a single machine framework scenario, like scikit, the tree model uses unique feature values for continuous features as splits candidates for the best fit calculation. Whereas in Apache Spark, the tree model uses quantiles for each feature as a split candidate.
Also to add that you shouldn't forget as well that cross validation is a heavy and long tasks as it's proportional to the combination of your 3 hyper-parameters times the number of folds times the time spent to train each model (GridSearch approach). You might want to cache your data per example for a start but it will still not gain you much time. I believe that spark is an overkill for this amount of data. You might want to use scikit learn instead and maybe use spark-sklearn to distributed local model training.
Spark will learn each model separately and sequentially with the hypothesis that data is distributed and big.
You can of course optimize performance using columnar data based file formats like parquet and tuning spark itself, etc. it's too broad to talk about it here.
You can read more about tree models scalability with spark-mllib in this following blogpost :
Scalable Decision Trees in MLlib

How to increase the accuracy of neural network model in spark?

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc,"/home/.../neural.txt").toDF()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
val layers = Array[Int](4, 5, 4, 4)
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
I am using MultilayerPerceptronClassifier to build neural network in Spark. I am getting 62.5% of accuracy. What all parameters I should change to get good accuracy?
As some people has said , the question is too broad and cant be answered without more detail but some advice(independently of the models/altorithms used or the tools and libraries for implementing them) would be:
Use a cross validation set and perform some cross validation with different network architectures.
Plot "Learning curves"
Identify if you are having high bias or high variance
See if you can or need to apply feature scaling and/or normalization.
Do some "Error Analysis"(manually verify which examples failed and evaluate or categorize them to see if you can find a pattern)
Not neccesarily in that order, but that could help you identify if you have underfitting, overfitting, if you need more training data, add or remove features, add regularization, etc. In summary , perform machine learning debugging.
Hope that helps, you can find more deep details about this in Andrew Ngs series of videos, starting with this:
https://www.youtube.com/watch?v=qIfLZAa32H0

Apache Spark: Applying a function from sklearn parallel on partitions

I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."
Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:
from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X = [...] # list of texts
y = [...] # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parralelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])
local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
))
local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.
Im not 100% sure that I am following, but there are a number of partition methods, such as mapPartitions. These operators hand you the Iterator on each node, and you can do whatever you want to the data and pass it back through a new Iterator
rdd.mapPartitions(iter=>{
//Spin up something expensive that you only want to do once per node
for(item<-iter) yield {
//do stuff to the items using your expensive item
}
})
If your data set is small (it is possible to load it and train on one worker) you can do something like this:
def trainModel[T](modelId: Int, trainingSet: List[T]) = {
//trains model with modelId and returns it
}
//fake data
val data = List()
val numberOfModels = 100
val broadcastedData = sc.broadcast(data)
val trainedModels = sc.parallelize(Range(0, numberOfModels))
.map(modelId => (modelId, trainModel(modelId, broadcastedData.value)))
I assume you have some list of models (or some how parametrized models) and you can give them ids. Then in function trainModel you pick one depending on id. And as result you will get rdd of pairs of trained models and their ids.

Resources