I have a Spark streaming application which trains a model and periodically stores the model to HFS. In a http based web service, I would like to POST some values and retrieve a prediction for it. The service should also reload the model on demand (e.g. via GET request).
I implemented a web server with Spark and Spray, it works for proof-of-concept. But I'm not sure if it is a good design solution. What about providing the web server to external services if it runs on a cluster? How can I define on which node the service will be available? I'm not even sure if it is the right idea to use prediction models in this way. Maybe the best-practice is to integrate Spark in a standalone application and access the model on the shared filesystem (e.g. hfs), but this will lack cluster support, wont't it?
Summary: What is the best-practice design to build a prediction web service with Apache Spark.
Related
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/
I"m trying to figure out how to design my service in a microservice architecture.
My service aim is to calculate aggregations, we are using Apache Spark batch jobs.
So I'm really interested to get your advice and some experience you get in similar cases.
Im having some difficulties figuring out how to use spark's machine learning capabilities in a real life production environment.
What i want to do is the following:
Develop a new ml model using notebooks
Serve the learned model using REST api (something like POST - /api/v1/mymodel/predict)
Let say the ml training process is handled by a notebook, and once the model requirements are fulfilled it's saved into an hdfs file, to be later loaded by a spark application
I know i could write a long running spark application that exposes the api and run it on my spark cluster, but i don't think this is really a scalable approach, because even if the data transformations and the ml functions would run on the workers node, the http/api related code would still run on one node, the one on wich spark-submit is invoked (correct me if i'm wrong).
One other approach is to use the same long running application, but in a local-standalone cluster. I could deploy the same application as many times as i want, and put a load balancer in front of it. With this approach the http/api part is handled fine, but the spark part is not using the cluster capabilities at all (this could not be a problem, due to fact that it should only perform a single prediction per request)
There is a third approach wich uses SparkLauncher, wich wraps the spark job in a separate jar, but i don't really like flying jars, and it is difficult to retrieve the result of the prediction (a queue maybe, or hdfs)
So basically the question is: what is the best approach to consume spark's ml models through rest api?
Thank You
you have three options
trigger batch ML job via spark api spark-jobserver, upon client request
trigger batch ML job via scheduler airflow , write output to DB, expose DB via rest to client
keep structured-streaming / recursive functionon to scan input data source, update / append DB continuously, expose DB via rest to client
If you have single prediction per request, and your data input is constantly updated, I would suggest option 3, which would transform data in near-real-time at all times, and client would have constant access to output, you can notify client when new data is completed by sending notification via rest or sns, you could keep pretty small spark cluster that would handle data ingest, and scale rest service and DB upon request / data volume (load balancer)
If you anticipate rare requests where data source is updated periodically lets say once a day, option 1 or 2 will be suitable as you can launch bigger cluster and shut it down when completed.
Hope it helps.
The problem is you don't want to keep your spark cluster running and deploy your REST API inside it for the prediction as it's slow.
So to achieve real-time prediction with low latency, Here are a couple of solutions.
What we are doing is Training the model, exporting the model and use the model outside Spark to do the Prediction.
You can export the model as a PMML file if the ML Algorithm you used is supported by the PMML standards. Spark ML Models can be exported as JPMML file using the jpmml library. And then you can create your REST API and use JPMML Evaluator to predict using your Spark ML Models.
MLEAP MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services. It supports multiple platforms, though I have just used it for Spark ML models and it works really well.
I have a spark application deployed on the cluster. I want to run the application with some variables passed from another application running on a remote machine. For example I will pass a query string from the application running remotely and I want my spark application to listen to that and process the query and give back the response to the caller.
Is it possible to do with any library or feature provided by spark.
A Spark application is like any other application. An application can take remote commands in a million different ways. Perhaps most common is to make the application an HTTP server. Then it can be remote controlled through a web interface or a REST API.
If you're using Spark through Scala, the Play Framework is a popular option.