Pyspark - Save statsmodel trained models in GCS Buckets (or Driver) - apache-spark

I am working with Pyspark to build regression models for each group in the data. To do so, I am using Pandas_UDF and am able to build the models successfully and generate the outputs and return them as a DataFrame.
At the same time, I wanted to save the model for future use but somehow I could not save it to the GCS Bucket.
If I simply save the model to the CWD, the models are saved but don't have access to the models saved locally in the worker nodes
Code used to save the model locally as follows:
model.save(l2_category+'.pickle')
The code used to save the model to the GCS bucket is as follows:
with open(os.path.join(gcs_path,l2_category+'.pkl'),'w') as f:
pickle.dump(model,f)
But saving to GCS throws an error
FileNotFoundError: [Errno 2] No such file or directory: 'gs://darkstores-data-eng_stg/multi_sku_test/models/610a967aac6e434026bb7fa9.pkl'
Can someone help me with the best way to tackle this?

Related

Save Trained PySpark ML Pipeline to local file system

I have a PySpark pipeline that looks like this,
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)
pipeline = Pipeline(stages=[scaler, pca, kmeans])
I trained this pipleline over my data and I can produce outputs with it.
Now, I want to save this model to disk and later use it another piece of code.
I'm able to call save method on the pipeline object and save it to hdfs like this,
pipeline.save("test_pipe")
I want to save the pipeline to local file system though for later use.
I've tried to write like this but failed with an error,
pipeline.save("file:///home/test_pipe")
How can save the trained pipeline to local file system?

Renaming Exported files from Spark Job

We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")

How to convert Neural network model (.h5) file to spark RDD file in python

I want to store the deep learning model data in spark environment as a RDD file and to load model from RDD file(either reverting the conversion) in Python. Can you give the possible way of doing it

Pyspark reading caffe models from HDFS

I am using caffe library for image detection using PySpark framework. I am able to run the spark program in local mode where model is present in local file system.
But when I want to deploy it into cluster mode, I don't know what is the correct way to do. I have tried the following approach:
Adding the files to HDFS, and using addfile or --file when submitting jobs
sc.addFile("hdfs:///caffe-public/dataset/test.caffemodel")
Reading the model in each worker node using
model_weight =SparkFiles.get('test.caffemodel')
net = caffe.Net(model_define, model_weight, caffe.TEST)
Since SparkFiles.get() will return the local file location in the worker node(not the HDFS one) so that I can reconstruct my model using the path it returns. This approach also works fine in local mode, however, in distributed mode it will result in the following error:
ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/files/xxx, byteCount=xxx, body=FileSegmentManagedBuffer{file=xxx, offset=0,length=xxxx}} to /192.168.100.40:37690; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
It seems like the data is too large to shuffle as discussed in Apache Spark: network errors between executors However, the size of model is only around 1M.
Updated:
I found that if the path in sc.addFile(path) is on HDFS, then the error will not appear. However, when the path is in local file system, the error will appear.
My questions are
Is any other possibility that will cause the above exception other
than the size of the file. ( The spark is running on YARN, and I use the default shuffle service not external shuffle service )
If I do not add the file when submmit, how do I read the model file
from HDFS using PySpark? (So that I can reconstruct model using
caffe API). Or is there any way to get the path other than
SparkFiles.get()?
Any suggestions will be appreciated!!

MLlib model (RandomForestModel) saves model with numerous small parquet files

I'm trying to train the MLlib RandomForestRegression Model using the RandomForest.trainRegressor API.
After training, when I try to save the model the resulting model folder has a size of 6.5MB on disk, but there are 1120 small parquet files in the data folder that seem to be unnecessary and slow to upload/download to s3.
Is this the expected behavior? I'm certainly repartitioning the labeledPoints to have 1 partition but this is happening regardless.
Repartitioning with rdd.repartition(1) before training does not help much. It make training potentially slower because all parallel operation are effectively sequential as whole parallelism stuff is based on partitions.
Instead of that I've come up with simple hack and set spark.default.parallelism to 1 as the save procedure use sc.parallelize method to create stream to save.
Keep in mind that it will affect countless places in your app like groupBy and join. My suggestion is to extract train&save of the model to separate application and run it in isolation.

Resources