why exactly should we avoid using for loops in PySpark? - apache-spark

I was attempting to speed up some of my pipelines but couldn't get a precise answer. Are some for loops OK, depending on the implementation? When is it OK to use a loop without taking too much of a performance hit? I've read
This nice article by David Mudrauskas
This nice Stack Overflow answer
The Spark RDD docs, which advises
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode.
If we were to use a for loop to step through and train a series of models, persisting them in the models dict,
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
dv = ['y1','y2','y3', ...]
models = {}
for v in dv:
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol=v,predictionCol=f'prediction_{v}')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
would that be meaningfully slower than enumerating and training them each explicitly like below? are they equivalent?
# y1
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol='y1',predictionCol=f'prediction_y1')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
#y2
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol='y2',predictionCol=f'prediction_y2')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
#y3
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
model = LogisticRegression(featuresCol='features',labelCol='y3',predictionCol=f'prediction_y3')
pipeline = Pipeline(stages=[assembler,model])
pipe = pipeline.fit(train)
models[v] = pipe
...
my understanding is that the SparkML library has parallelism built in, but I'm wondering if using the loop degrades this parallelism, and if there is a better way to train models in parallel. It's very slow on my setup, so maybe I'm doing something wrong... Thanks in advance!

Both approaches are the same. Irrespective of the approach, the parallelism depends on the number of cores you have across your Executors. You can read more in this article: https://www.javacodegeeks.com/2018/10/anatomy-apache-spark-job.html

Related

optimal pyspark ML pipeline setup to reduce time to run

I have a Pipeline with close to 100 models, assembled like so
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
assembler = VectorAssembler(inputCols=feat_cols, outputCol='features')
scaler = MinMaxScaler(inputCol='features', outputCol='features_scaled')
pca = PCA(k=25, inputCol='features_scaled', outputCol='pca_output')
transformer_pipe = Pipeline(stages=[assembler, scaler, pca])
transformer = transformer_pipe.fit(train)
train_transformed = transformer.transform(train)
test_transformed = transformer.transform(test)
models = []
for c in cols:
models.append(LogisticRegression(
regParam=1.,
featuresCol='pca_output',
labelCol= c,
predictionCol=f'prediction_{c}',
rawPredictionCol=f'raw_prediction_{c}',
probabilityCol=f'probability_{c}',
weightCol=f'weight_{c}',
family='binomial'
)
)
pipeline = Pipeline(stages=models)
at this point I invoke the fit and transform methods to train the models and get my predictions
pipe = pipeline.fit(train_transformed)
preds = pipe.transform(test_transformed)
And yet, what seems like a straightforward invocation takes ages and ages on an EMR cluster with 100s of cores (1 master instances, up to 9 core instances, set up with autoscaling with 64 vCPUs each).
I've run into OOM memories and have played with the session parameters in the notebook calling
%%configure -f
{"conf":{"spark.executor.memory":"12g",
"spark.driver.memory":"12g",
"spark.driver.cores":"3",
"spark.driver.memoryOverhead":"2048",
"spark.executor.cores":"3"}
}
before the session begins, and while it no longer runs into memory issues, the end-to-end process still takes a very long time. Any idea how I can speed this process up? Thanks in advance!

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less.
In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark.
Here it is:
df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data)
dataframe = sqlContext.createDataFrame(df)
train, test = dataframe.randomSplit([0.7, 0.3])
columnTypes = dataframe.dtypes
for ct in columnTypes:
if ct[1] == 'string' and ct[0] != 'label':
categoricalCols += [ct[0]]
elif ct[0] != 'label':
numericCols += [ct[0]]
stages = []
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
stages += [stringIndexer]
assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel', handleInvalid='skip')
stages += [labelIndexer]
estimator = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [estimator]
parameters = {"maxDepth" : [3, 5, 10, 15], "maxBins" : [6, 12, 24, 32], "numTrees" : [3, 5, 10]}
paramGrid = ParamGridBuilder()
for key, value in parameters.iteritems():
paramGrid.addGrid(estimator.getParam(key), value)
estimatorParamMaps = (paramGrid.build())
pipeline = Pipeline(stages=stages)
crossValidator = CrossValidator(estimator=pipeline, estimatorParamMaps=estimatorParamMaps, evaluator=MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1'), numFolds=3)
pipelineModel = crossValidator.fit(train)
predictions = pipelineModel.transform(test)
evaluator = pipeline.getEvaluator().evaluate(predictions)
Thanks in advance, any comments/suggestions are highly appreciated :)
The following may not solve your problem completely but it should give you some pointer to start.
The first problem that you are facing is the disproportion between the amount of data and the resources.
This means that since you are parallelizing a local collection (pandas dataframe), Spark will use the default parallelism configuration. Which is most likely to be resulting in 48 partitions with less than 0.5mb per partition. (Spark doesn't do well with small files nor small partitions)
The second problem is related to expensive optimizations/approximations techniques used by Tree models in Spark.
Spark tree models use some tricks to optimally bucket continuous variables. With small data it is way cheaper to just get the exact splits.
It mainly uses approximated quantiles in this case.
Usually, in a single machine framework scenario, like scikit, the tree model uses unique feature values for continuous features as splits candidates for the best fit calculation. Whereas in Apache Spark, the tree model uses quantiles for each feature as a split candidate.
Also to add that you shouldn't forget as well that cross validation is a heavy and long tasks as it's proportional to the combination of your 3 hyper-parameters times the number of folds times the time spent to train each model (GridSearch approach). You might want to cache your data per example for a start but it will still not gain you much time. I believe that spark is an overkill for this amount of data. You might want to use scikit learn instead and maybe use spark-sklearn to distributed local model training.
Spark will learn each model separately and sequentially with the hypothesis that data is distributed and big.
You can of course optimize performance using columnar data based file formats like parquet and tuning spark itself, etc. it's too broad to talk about it here.
You can read more about tree models scalability with spark-mllib in this following blogpost :
Scalable Decision Trees in MLlib

ML Pipeline and metrics: Precision, Recall, AUC-ROC, F1Score

I'm using ML Pipeline, something like:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(columns)
.setOutputCol("features");
LogisticRegression lr = new LogisticRegression().setLabelCol(targetColumn);
lr.setMaxIter(10).setRegParam(0.01).setFeaturesCol("features");
Pipeline logisticRegression = new Pipeline();
logisticRegression.setStages(new PipelineStage[] {assembler, lr});
PipelineModel logisticRegressionModel = logisticRegression.fit(learningData);
What I want to is the way to get standard metric like Precision, Recall, AUC-ROC, F1-SCORE, ACCURACY on this model.
I've found BinaryClassificationMetrics - but not sure if it's compatible at all.
RegressionEvaluator seems to return only mse|rmse|r2|mae.
So what is the right way to extract Precision, Recall, etc with ML Pipeline?
Couple of things missing from Ryan's answer above.
I can confirm the following works (Note: my use case was for Multiclass Classification)
val scoredTestSet = model.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").rdd.map(r => (r.getDouble(0), r.getDouble(1)))
val multiModelMetrics = new MulticlassMetrics(predictionAndLabelsRDD)
once you have scored data, get the prediction and label and pass that to BinaryClassificationMetrics
something like below (thought it's in scala I hope it helps)
val scoredTestSet = logisticRegressionModel.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").map(r => (r.getDouble(0), r.getDouble(1)))
val binMetrics = new BinaryClassificationMetrics(predictionAndLabels)
// binMetrics.areaUnderROC
other examples from https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
prediction in this case is 1.0 or 0.0
you can also extract the probability and use that instead of the prediction so that binMetrics can show data for multiple thresholds

How to eval spark.ml model without DataFrames/SparkContext?

With Spark MLLib, I'd build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features.
It seems like with Spark ML, predict is now called transform and only acts on a DataFrame.
Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame?
Am I missing something?
Re: Is there any way to build a DataFrame outside of Spark?
It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.
Here is my solution to use spark models outside of spark context (using PMML):
You create model with a pipeline like this:
SparkConf sparkConf = new SparkConf();
SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));
Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result = evaluator.evaluate(args);
Spent days on this problem too. It's not straightforward. My third suggestion involves code I have written specifically for this purpose.
Option 1
As other commenters have said, predict(Vector) is now available. However, you need to know how to construct a vector. If you don't, see Option 3.
Option 2
If the goal is to avoid setting up a Spark server (standalone or cluster modes), then its possible to start Spark in local mode. The whole thing will run inside a single JVM.
val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
// create dataframe from file, or make it up from some data in memory
// use model.transform() to get predictions
But this brings unnecessary dependencies to your prediction module, and it consumes resources in your JVM at runtime. Also, if prediction latency is critical, for example making a prediction within a millisecond as soon as a request comes in, then this option is too slow.
Option 3
MLlib FeatureHasher's output can be used as an input to your learner. The class is good for one hot encoding and also for fixing the size of your feature dimension. You can use it even when all your features are numerical. If you use that in your training, then all you need at prediction time is the hashing logic there. Its implemented as a spark transformer so it's not easy to re-use outside of a spark environment. So I have done the work of pulling out the hashing function to a lib. You apply FeatureHasher and your learner during training as normal. Then here's how you use the slimmed down hasher at prediction time:
// Schema and hash size must stay consistent across training and prediction
val hasher = new FeatureHasherLite(mySchema, myHashSize)
// create sample data-point and hash it
val feature = Map("feature1" -> "value1", "feature2" -> 2.0, "feature3" -> 3, "feature4" -> false)
val featureVector = hasher.hash(feature)
// Make prediction
val prediction = model.predict(featureVector)
You can see details in my github at tilayealemu/sparkmllite. If you'd rather copy my code, take a look at FeatureHasherLite.scala.There are sample codes and unit tests too. Feel free to create an issue if you need help.

Apache Spark: Applying a function from sklearn parallel on partitions

I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."
Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:
from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X = [...] # list of texts
y = [...] # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parralelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])
local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
))
local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.
Im not 100% sure that I am following, but there are a number of partition methods, such as mapPartitions. These operators hand you the Iterator on each node, and you can do whatever you want to the data and pass it back through a new Iterator
rdd.mapPartitions(iter=>{
//Spin up something expensive that you only want to do once per node
for(item<-iter) yield {
//do stuff to the items using your expensive item
}
})
If your data set is small (it is possible to load it and train on one worker) you can do something like this:
def trainModel[T](modelId: Int, trainingSet: List[T]) = {
//trains model with modelId and returns it
}
//fake data
val data = List()
val numberOfModels = 100
val broadcastedData = sc.broadcast(data)
val trainedModels = sc.parallelize(Range(0, numberOfModels))
.map(modelId => (modelId, trainModel(modelId, broadcastedData.value)))
I assume you have some list of models (or some how parametrized models) and you can give them ids. Then in function trainModel you pick one depending on id. And as result you will get rdd of pairs of trained models and their ids.

Resources