Serve real-time predictions with trained Spark ML model [duplicate] - apache-spark

This question already has answers here:
How to serve a Spark MLlib model?
(4 answers)
Closed 5 years ago.
We are currently testing a prediction engine based on Spark's implementation of LDA in Python:
https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda
https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA
(we are using the pyspark.ml package, not pyspark.mllib)
We were able to succesfuly train a model on a Spark cluster (using Google Cloud Dataproc). Now we are trying to use the model to serve real-time predictions as an API (e.g. flask application).
What would be the best approach to achieve so?
Our main pain point is that it seems we need to bring back the whole Spark environnement in order to load the trained model and run the transform.
So far we've tried running Spark in local mode for each received request but this approach gave us:
Poor performances (time to spin-up the SparkSession, load the models, run the transform...)
Poor scalability (inability to process concurrent requests)
The whole approach seems quite heavy, would there be a simpler alternative, or even one that would not need to imply Spark at all?
Bellow are simplified code of the training and prediction steps.
Training code
def train(input_dataset):
conf = pyspark.SparkConf().setAppName("lda-train")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Generate count vectors
count_vectorizer = CountVectorizer(...)
vectorizer_model = count_vectorizer.fit(input_dataset)
vectorized_dataset = vectorizer_model.transform(input_dataset)
# Instantiate LDA model
lda = LDA(k=100, maxIter=100, optimizer="em", ...)
# Train LDA model
lda_model = lda.fit(vectorized_dataset)
# Save models to external storage
vectorizer_model.write().overwrite().save("gs://...")
lda_model.write().overwrite().save("gs://...")
Prediction code
def predict(input_query):
conf = pyspark.SparkConf().setAppName("lda-predict").setMaster("local")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Load models from external storage
vectorizer_model = CountVectorizerModel.load("gs://...")
lda_model = DistributedLDAModel.load("gs://...")
# Run prediction on the input data using the loaded models
vectorized_query = vectorizer_model.transform(input_query)
transformed_query = lda_model.transform(vectorized_query)
...
spark.stop()
return transformed_query

If you already have a trained Machine Learning model in spark, You can use Hydroshpere Mist to serve the models(testing or prediction) using rest api without creating a Spark Context. This will save you from recreating the spark environment and rely only on web services for prediction
Refer:
https://github.com/Hydrospheredata/mist
https://github.com/Hydrospheredata/spark-ml-serving
https://github.com/Hydrospheredata/hydro-serving

Related

how to cache random forest models in spark

My platform is spark 2.1.0, using python language.
Now I have about 100 random forest multiclassification models ,I have saved them in the HDFS.There are 100 datasets saved in the HDFS too.
I want to predict the dataset using corresponding model.If the models and datasets are cache in memory,the predict will be more than 10 times faster.
But I do not know how to cache models because the model is not RDD or Dataframe.
Thanks!
TL;DR Just cache the data, if it is ever reused outside prediction process, and if not you can even skip that.
RandomForestModel is a local object not backed by distributed data structures, there is no DAG to recompute, and prediction process is a simple, map-only job. Therefore model cannot be cached and even if it could, the operation would be meaningless.
See also (Why) do we need to call cache or persist on a RDD

Spark Structured Streaming and Spark-Ml Regression

Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources.
How I'm supposed to apply regressions on structured streaming sources?
(A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink)
Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.
You can however:
Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:
Fetch latest model when calling ForeachWriter.open and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
Compute loss for each record in ForeachWriter.process and update accumulator.
Push loses to external store when calling ForeachWriter.close.
This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)

Train Word2Vec on 100+GB of data

I have more than 100Gb of text data stored on s3 in multiple parquet file. I need to train a Word2Vec model on this. I tried using Spark but it run into memory error for more than 10GB data.
My next option is to train using TensorFlow on EMR. But I am unable to decide what should be right training strategy for such case? One big node or multiple small nodes, what should be size of that node. How tensorflow manage distributed data? Is batch training an option?

How to update a ML model during a spark streaming job without restarting the application?

I've got a Spark Streaming job whose goal is to :
read a batch of messages
predict a variable Y given these messages using a pre-trained ML pipeline
The problem is, I'd like to be able to update the model used by the executors without restarting the application.
Simply put, here's what it looks like :
model = #model initialization
def preprocess(keyValueList):
#do some preprocessing
def predict(preprocessedRDD):
if not preprocessedRDD.isEmpty():
df = #create df from rdd
df = model.transform(df)
#more things to do
stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)
stream.mapPartitions(preprocess).foreachRDD(predict)
In this case, the model is simply used. Not updated.
I've thought about several possibilities but I have now crossed them all out :
broadcasting the model everytime it changes (cannot update it, read-only)
reading the model from HDFS on the executors (it needs the SparkContext so not possible)
Any idea ?
Thanks a lot !
I've solved this issue before in two different ways:
a TTL on the model
rereading the model on each batch
Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).
The function you pass to foreachRDD is executed by the driver, it's only the rdd operations themselves that are performed by executors, as such you don't need to serialize the model - assuming you are using a Spark ML pipeline which operates on RDD's, which as far as I know they all do. Spark handles the training/prediction for you, you don't need to manually distribute it.

How best to fit many Spark ML models

(PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes)
I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor.
The naive approach I was going to start with is:
create spark dataframe of training dataset
for i in (1,1000):
use df.sample() to create a sample_df
train the model (logistic classifier) on sample_df
Although each individual model is fit across the cluster, this doesn't seem to be very 'parallel' thinking.
Should I be doing this a different way?

Resources