Save Trained PySpark ML Pipeline to local file system - apache-spark

I have a PySpark pipeline that looks like this,
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)
pipeline = Pipeline(stages=[scaler, pca, kmeans])
I trained this pipleline over my data and I can produce outputs with it.
Now, I want to save this model to disk and later use it another piece of code.
I'm able to call save method on the pipeline object and save it to hdfs like this,
pipeline.save("test_pipe")
I want to save the pipeline to local file system though for later use.
I've tried to write like this but failed with an error,
pipeline.save("file:///home/test_pipe")
How can save the trained pipeline to local file system?

Related

Pyspark - Save statsmodel trained models in GCS Buckets (or Driver)

I am working with Pyspark to build regression models for each group in the data. To do so, I am using Pandas_UDF and am able to build the models successfully and generate the outputs and return them as a DataFrame.
At the same time, I wanted to save the model for future use but somehow I could not save it to the GCS Bucket.
If I simply save the model to the CWD, the models are saved but don't have access to the models saved locally in the worker nodes
Code used to save the model locally as follows:
model.save(l2_category+'.pickle')
The code used to save the model to the GCS bucket is as follows:
with open(os.path.join(gcs_path,l2_category+'.pkl'),'w') as f:
pickle.dump(model,f)
But saving to GCS throws an error
FileNotFoundError: [Errno 2] No such file or directory: 'gs://darkstores-data-eng_stg/multi_sku_test/models/610a967aac6e434026bb7fa9.pkl'
Can someone help me with the best way to tackle this?

How to convert Neural network model (.h5) file to spark RDD file in python

I want to store the deep learning model data in spark environment as a RDD file and to load model from RDD file(either reverting the conversion) in Python. Can you give the possible way of doing it

Serve real-time predictions with trained Spark ML model [duplicate]

This question already has answers here:
How to serve a Spark MLlib model?
(4 answers)
Closed 5 years ago.
We are currently testing a prediction engine based on Spark's implementation of LDA in Python:
https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda
https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA
(we are using the pyspark.ml package, not pyspark.mllib)
We were able to succesfuly train a model on a Spark cluster (using Google Cloud Dataproc). Now we are trying to use the model to serve real-time predictions as an API (e.g. flask application).
What would be the best approach to achieve so?
Our main pain point is that it seems we need to bring back the whole Spark environnement in order to load the trained model and run the transform.
So far we've tried running Spark in local mode for each received request but this approach gave us:
Poor performances (time to spin-up the SparkSession, load the models, run the transform...)
Poor scalability (inability to process concurrent requests)
The whole approach seems quite heavy, would there be a simpler alternative, or even one that would not need to imply Spark at all?
Bellow are simplified code of the training and prediction steps.
Training code
def train(input_dataset):
conf = pyspark.SparkConf().setAppName("lda-train")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Generate count vectors
count_vectorizer = CountVectorizer(...)
vectorizer_model = count_vectorizer.fit(input_dataset)
vectorized_dataset = vectorizer_model.transform(input_dataset)
# Instantiate LDA model
lda = LDA(k=100, maxIter=100, optimizer="em", ...)
# Train LDA model
lda_model = lda.fit(vectorized_dataset)
# Save models to external storage
vectorizer_model.write().overwrite().save("gs://...")
lda_model.write().overwrite().save("gs://...")
Prediction code
def predict(input_query):
conf = pyspark.SparkConf().setAppName("lda-predict").setMaster("local")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Load models from external storage
vectorizer_model = CountVectorizerModel.load("gs://...")
lda_model = DistributedLDAModel.load("gs://...")
# Run prediction on the input data using the loaded models
vectorized_query = vectorizer_model.transform(input_query)
transformed_query = lda_model.transform(vectorized_query)
...
spark.stop()
return transformed_query
If you already have a trained Machine Learning model in spark, You can use Hydroshpere Mist to serve the models(testing or prediction) using rest api without creating a Spark Context. This will save you from recreating the spark environment and rely only on web services for prediction
Refer:
https://github.com/Hydrospheredata/mist
https://github.com/Hydrospheredata/spark-ml-serving
https://github.com/Hydrospheredata/hydro-serving

How to update a ML model during a spark streaming job without restarting the application?

I've got a Spark Streaming job whose goal is to :
read a batch of messages
predict a variable Y given these messages using a pre-trained ML pipeline
The problem is, I'd like to be able to update the model used by the executors without restarting the application.
Simply put, here's what it looks like :
model = #model initialization
def preprocess(keyValueList):
#do some preprocessing
def predict(preprocessedRDD):
if not preprocessedRDD.isEmpty():
df = #create df from rdd
df = model.transform(df)
#more things to do
stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)
stream.mapPartitions(preprocess).foreachRDD(predict)
In this case, the model is simply used. Not updated.
I've thought about several possibilities but I have now crossed them all out :
broadcasting the model everytime it changes (cannot update it, read-only)
reading the model from HDFS on the executors (it needs the SparkContext so not possible)
Any idea ?
Thanks a lot !
I've solved this issue before in two different ways:
a TTL on the model
rereading the model on each batch
Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).
The function you pass to foreachRDD is executed by the driver, it's only the rdd operations themselves that are performed by executors, as such you don't need to serialize the model - assuming you are using a Spark ML pipeline which operates on RDD's, which as far as I know they all do. Spark handles the training/prediction for you, you don't need to manually distribute it.

Pyspark reading caffe models from HDFS

I am using caffe library for image detection using PySpark framework. I am able to run the spark program in local mode where model is present in local file system.
But when I want to deploy it into cluster mode, I don't know what is the correct way to do. I have tried the following approach:
Adding the files to HDFS, and using addfile or --file when submitting jobs
sc.addFile("hdfs:///caffe-public/dataset/test.caffemodel")
Reading the model in each worker node using
model_weight =SparkFiles.get('test.caffemodel')
net = caffe.Net(model_define, model_weight, caffe.TEST)
Since SparkFiles.get() will return the local file location in the worker node(not the HDFS one) so that I can reconstruct my model using the path it returns. This approach also works fine in local mode, however, in distributed mode it will result in the following error:
ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/files/xxx, byteCount=xxx, body=FileSegmentManagedBuffer{file=xxx, offset=0,length=xxxx}} to /192.168.100.40:37690; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
It seems like the data is too large to shuffle as discussed in Apache Spark: network errors between executors However, the size of model is only around 1M.
Updated:
I found that if the path in sc.addFile(path) is on HDFS, then the error will not appear. However, when the path is in local file system, the error will appear.
My questions are
Is any other possibility that will cause the above exception other
than the size of the file. ( The spark is running on YARN, and I use the default shuffle service not external shuffle service )
If I do not add the file when submmit, how do I read the model file
from HDFS using PySpark? (So that I can reconstruct model using
caffe API). Or is there any way to get the path other than
SparkFiles.get()?
Any suggestions will be appreciated!!

Resources