Vertex ai custom model training for pyspark ml model - apache-spark

Is it possible to train a spark/pyspark ML lib model using VertexAI custom container model building? I couldn't find any reference in the vertex ai documents regarding spark model training. For distributed processing model building only options available are PyTorch or TensorFlow.

It is possible with custom containers if you leverage the Spark Kubernetes operator but this is not a well documented workflow and will require complex set up. GCP's preferred way to run Spark jobs is on Dataproc https://cloud.google.com/dataproc which supports PySpark, SparkR, Scala. You can still trigger a Dataproc Spark job from Vertex Pipelines and save the model for predictions in Vertex via MLeap.

Related

Deploying SNIPS model on spark clusters

I have developed a SNIPS NLU or similar model. I am trying to deploy the model on spark clusters using pyspark. I am not sure how to do it. Any help?

Using AWS Sagemaker for model performance without creating endpoint

I've been using Amazon Sagemaker Notebooks to build a pytorch model for an NLP task.
I know you can use Sagemaker to train, deploy, hyper parameter tuning, and model monitoring.
However, it looks like you have to create an inference endpoint in order to monitor the model's inference performance.
I already have a EC2 instance setup to perform inference tasks on our model, which is currently on a development box and rather not use an endpoint to make
Is it possible to use Sagemaker to train, run hyperparam tuning and model eval without creating an endpoint.
If you don't want to keep an inference endpoint up, one option is to use SageMaker Processing to run a job that takes your trained model and test dataset as input, performs inference and computes evaluation metrics, and saves them to S3 in a JSON file.
This Jupyter notebook example steps through (1) preprocessing training and test data, (2) training a model, then (3) evaluating the model
You can deploy your model on AWS SageMaker by using two approaches: set up an endpoint and create a batch transform job. I guess you probably can try the latter.
The good thing about using a batch transform job is that you can specify the S3 bucket path for both input and output data. When the job is completed, it will upload the output to the s3 path directly.

Can MLLib classifiers be trained and used without a Spark installation?

I want to use some of the classifiers provided by MLLib (random forests, etc) but I want to use them without connecting to a Spark cluster.
If I need to somehow run some Spark stuff in-process so that I have a Spark context to use, that's fine. But I haven't been able to find any information or an example for such a use case.
So my two questions are:
Is there a way to use the MLLib classifiers without a Spark context at all?
Otherwise, can I use them by starting a Spark context in-process, without needing any kind of actual Spark installation?
org.apache.spark.mllib models:
Cannot be trained without Spark cluster.
Usually can be used for predictions without cluster, with exception to distributed models like ALS.
org.apache.spark.ml models:
Require Spark cluster for training.
Require Spark cluster for predictions although it might change in the future (https://issues.apache.org/jira/browse/SPARK-10413)
There is a number of third party tools which are designed to export Spark ml models to the form which can be used in Spark agnostic environment (jpmml-spark and modeldb to enumerate a few, without special preference).
Spark mllib models have limited PMML support as well.
Commercial vendors usually provide their own tools for productionizing Spark models.
You can of course use local "cluster", but it is probably still a bit to heavy for most of possible applications. Starting a full context take at least a few seconds, and has significant memory footprint.
Also:
Best Practice to launch Spark Applications via Web Application?
How to serve a Spark MLlib model?

Apache Spark & Machine Learning - Using in production

Im having some difficulties figuring out how to use spark's machine learning capabilities in a real life production environment.
What i want to do is the following:
Develop a new ml model using notebooks
Serve the learned model using REST api (something like POST - /api/v1/mymodel/predict)
Let say the ml training process is handled by a notebook, and once the model requirements are fulfilled it's saved into an hdfs file, to be later loaded by a spark application
I know i could write a long running spark application that exposes the api and run it on my spark cluster, but i don't think this is really a scalable approach, because even if the data transformations and the ml functions would run on the workers node, the http/api related code would still run on one node, the one on wich spark-submit is invoked (correct me if i'm wrong).
One other approach is to use the same long running application, but in a local-standalone cluster. I could deploy the same application as many times as i want, and put a load balancer in front of it. With this approach the http/api part is handled fine, but the spark part is not using the cluster capabilities at all (this could not be a problem, due to fact that it should only perform a single prediction per request)
There is a third approach wich uses SparkLauncher, wich wraps the spark job in a separate jar, but i don't really like flying jars, and it is difficult to retrieve the result of the prediction (a queue maybe, or hdfs)
So basically the question is: what is the best approach to consume spark's ml models through rest api?
Thank You
you have three options
trigger batch ML job via spark api spark-jobserver, upon client request
trigger batch ML job via scheduler airflow , write output to DB, expose DB via rest to client
keep structured-streaming / recursive functionon to scan input data source, update / append DB continuously, expose DB via rest to client
If you have single prediction per request, and your data input is constantly updated, I would suggest option 3, which would transform data in near-real-time at all times, and client would have constant access to output, you can notify client when new data is completed by sending notification via rest or sns, you could keep pretty small spark cluster that would handle data ingest, and scale rest service and DB upon request / data volume (load balancer)
If you anticipate rare requests where data source is updated periodically lets say once a day, option 1 or 2 will be suitable as you can launch bigger cluster and shut it down when completed.
Hope it helps.
The problem is you don't want to keep your spark cluster running and deploy your REST API inside it for the prediction as it's slow.
So to achieve real-time prediction with low latency, Here are a couple of solutions.
What we are doing is Training the model, exporting the model and use the model outside Spark to do the Prediction.
You can export the model as a PMML file if the ML Algorithm you used is supported by the PMML standards. Spark ML Models can be exported as JPMML file using the jpmml library. And then you can create your REST API and use JPMML Evaluator to predict using your Spark ML Models.
MLEAP MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services. It supports multiple platforms, though I have just used it for Spark ML models and it works really well.

fit in distributed, predict in a stand alone

How can one train (fit) a model in a distributed big data platform (e.g Apache Spark) yet use that model in a stand alone machine (e.g. JVM) with as little dependency as possible?
I heard of PMML yet I am not sure if it is enough. Also Spark 2.0 supports persistent model saving yet I am not sure what is necessary to load and run those models.
Apache Spark persistence is about saving and loading Spark ML pipelines in JSON data format (think of it as Python's pickle mechanism, or R's RDS mechanism). These JSON data structures map to Spark ML classes. They don't make sense on other platforms.
As for PMML, then you can convert Spark ML pipelines to PMML documents using the JPMML-SparkML library. You can execute PMML documents (doesn't matter whether they came from Apache Spark, Python or R) using the JPMML-Evaluator library. If you're using Apache Maven to manage and build your project, then JPMML-Evaluator can be included by adding just one dependency declaration to your project's POM.

Resources