Accessing spark program via APIs - apache-spark

I have a spark mlib program up and running that applies nlp on free text. I would need to access this program through rest apis. For eg: I need to apply ml on each rows of an excel sheet (convert each rows and get back results from the above ml program) by calling these APIs. How would I do that? basically trying to have an API wrapper around a spark program instead of going to jupyter notebook and manually doing it.

If you want to predict only via the rest API then your best solutions could be:
-Convert your model to PMML and make a web app to consume it
-Convert your model to MLeap and use the mleap-serving/mleap-spring-boot to host your machine learning model.
You can also implement spark in a web env, but you should consider the resource-heavy operations what spark needs.
Also you can use any scala IDE to work on your project. I prefer Intelij IDEA, but you can use Eclipse also. (you don't have to, you can simply export to PMML/Mleap from Jupiter)

Related

Deploying PySpark as a service

I have code in PySpark that parallelizes heavy computation. It takes two files and several parameters, performs the computation and generates information that at this moment is stored in a CSV file (tomorrow it would be ideal that the information is stored in a Postgres database).
Now, I want to consume this functionality as a service from a system made in Django, from which users will set the parameters of the Spark service, the selection of the two files and then query the results of the process.
I can think of several ways to cover this, but I don't know which one is the most convenient in terms of simplicity of implementation:
Use Spark API-REST: this would allow a request to be made from Django to the Spark cluster. The problem is that I found no official documentation, I get everything through blogs whose parameters and experiences correspond to a particular situation or solution. At no point does it mention, for example, how I could send files to be consumed through the API by Spark, or get the result.
Develop a simple API in Spark's Master to receive all parameters and execute the spark-submit command at the OS level. The awkwardness of this solution is that everything must be handled at the OS level, the parameter files and the final process result must be saved on disk and accessible by the Django server that wants to get it to save its information in the DB lately.
Integrate the Django app in the Master server, writing PySpark code inside It, Spark connects to the master server and runs the code that manipulates the RDDs. This scheme does not convince me because it sacrifices the independence between Spark and the Django application, which is already huge.
If someone could enlighten me about this, maybe due to lack of experience I am overlooking a cleaner, more robust, or idiomatic solution.
Thanks in advance

How to export a non PMML Algorithm for predicting the values in another JVM Process [duplicate]

I'm evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained?
For example in Azure ML, once trained, the model is exposed as a web service which can be consumed from any application, and it's a similar case with Amazon ML.
How do you serve/deploy ML models in Apache Spark ?
From one hand, a machine learning model built with spark can't be served the way you serve in Azure ML or Amazon ML in a traditional manner.
Databricks claims to be able to deploy models using it's notebook but I haven't actually tried that yet.
On other hand, you can use a model in three ways :
Training on the fly inside an application then applying prediction. This can be done in a spark application or a notebook.
Train a model and save it if it implements an MLWriter then load in an application or a notebook and run it against your data.
Train a model with Spark and export it to PMML format using jpmml-spark. PMML allows for different statistical and data mining tools to speak the same language. In this way, a predictive solution can be easily moved among tools and applications without the need for custom coding. e.g from Spark ML to R.
Those are the three possible ways.
Of course, you can think of an architecture in which you have RESTful service behind which you can build using spark-jobserver per example to train and deploy but needs some development. It's not a out-of-the-box solution.
You might also use projects like Oryx 2 to create your full lambda architecture to train, deploy and serve a model.
Unfortunately, describing each of the mentioned above solution is quite broad and doesn't fit in the scope of SO.
One option is to use MLeap to serve a Spark PipelineModel online with no dependencies on Spark/SparkContext. Not having to use the SparkContext is important as it will drop scoring time for a single record from ~100ms to single-digit microseconds.
In order to use it, you have to:
Serialize your Spark Model with MLeap utilities
Load the model in MLeap (does not require a SparkContext or any Spark dependencies)
Create your input record in JSON (not a DataFrame)
Score your record with MLeap
MLeap is well integrated with all the Pipeline Stages available in Spark MLlib (with the exception of LDA at the time of this writing). However, things might get a bit more complicated if you are using custom Estimators/Transformers.
Take a look at the MLeap FAQ for more info about custom transformers/estimators, performances, and integration.
You are comparing two rather different things. Apache Spark is a computation engine, while mentioned by you Amazon and Microsoft solutions are offering services. These services might as well have Spark with MLlib behind the scene. They save you from the trouble building a web service yourself, but you pay extra.
Number of companies, like Domino Data Lab, Cloudera or IBM offer products that you can deploy on your own Spark cluster and easily build service around your models (with various degrees of flexibility).
Naturally you build a service yourself with various open source tools. Which specifically? It all depends on what you are after. How user should interact with the model? Should there be some sort of UI or jest a REST API? Do you need to change some parameters on the model or the model itself? Are the jobs more of a batch or real-time nature? You can naturally build all-in-one solution, but that's going to be a huge effort.
My personal recommendation would be to take advantage, if you can, of one of the available services from Amazon, Google, Microsoft or whatever. Need on-premises deployment? Check Domino Data Lab, their product is mature and allows easy working with models (from building till deployment). Cloudera is more focused on cluster computing (including Spark), but it will take a while before they have something mature.
[EDIT] I'd recommend to have a look at Apache PredictionIO, open source machine learning server - amazing project with lot's of potential.
I have been able to just get this to work. Caveats: Python 3.6 + using Spark ML API (not MLLIB, but sure it should work the same way)
Basically, follow this example provided on MSFT's AzureML github.
Word of warning: the code as-is will provision but there is an error in the example run() method at the end:
#Get each scored result
preds = [str(x['prediction']) for x in predictions]
result = ",".join(preds)
# you can return any data type as long as it is JSON-serializable
return result.tolist()
Should be:
#Get each scored result
preds = [str(x['prediction']) for x in predictions]
#result = ",".join(preds)
# you can return any data type as long as it is JSON-serializable
output = dict()
output['predictions'] = preds
return json.dumps(output)
Also, completely agree with MLeap assessment answer, this can make the process run way faster but thought I would answer the question specifically

Productionizing Spark Pipeline [duplicate]

I'm evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained?
For example in Azure ML, once trained, the model is exposed as a web service which can be consumed from any application, and it's a similar case with Amazon ML.
How do you serve/deploy ML models in Apache Spark ?
From one hand, a machine learning model built with spark can't be served the way you serve in Azure ML or Amazon ML in a traditional manner.
Databricks claims to be able to deploy models using it's notebook but I haven't actually tried that yet.
On other hand, you can use a model in three ways :
Training on the fly inside an application then applying prediction. This can be done in a spark application or a notebook.
Train a model and save it if it implements an MLWriter then load in an application or a notebook and run it against your data.
Train a model with Spark and export it to PMML format using jpmml-spark. PMML allows for different statistical and data mining tools to speak the same language. In this way, a predictive solution can be easily moved among tools and applications without the need for custom coding. e.g from Spark ML to R.
Those are the three possible ways.
Of course, you can think of an architecture in which you have RESTful service behind which you can build using spark-jobserver per example to train and deploy but needs some development. It's not a out-of-the-box solution.
You might also use projects like Oryx 2 to create your full lambda architecture to train, deploy and serve a model.
Unfortunately, describing each of the mentioned above solution is quite broad and doesn't fit in the scope of SO.
One option is to use MLeap to serve a Spark PipelineModel online with no dependencies on Spark/SparkContext. Not having to use the SparkContext is important as it will drop scoring time for a single record from ~100ms to single-digit microseconds.
In order to use it, you have to:
Serialize your Spark Model with MLeap utilities
Load the model in MLeap (does not require a SparkContext or any Spark dependencies)
Create your input record in JSON (not a DataFrame)
Score your record with MLeap
MLeap is well integrated with all the Pipeline Stages available in Spark MLlib (with the exception of LDA at the time of this writing). However, things might get a bit more complicated if you are using custom Estimators/Transformers.
Take a look at the MLeap FAQ for more info about custom transformers/estimators, performances, and integration.
You are comparing two rather different things. Apache Spark is a computation engine, while mentioned by you Amazon and Microsoft solutions are offering services. These services might as well have Spark with MLlib behind the scene. They save you from the trouble building a web service yourself, but you pay extra.
Number of companies, like Domino Data Lab, Cloudera or IBM offer products that you can deploy on your own Spark cluster and easily build service around your models (with various degrees of flexibility).
Naturally you build a service yourself with various open source tools. Which specifically? It all depends on what you are after. How user should interact with the model? Should there be some sort of UI or jest a REST API? Do you need to change some parameters on the model or the model itself? Are the jobs more of a batch or real-time nature? You can naturally build all-in-one solution, but that's going to be a huge effort.
My personal recommendation would be to take advantage, if you can, of one of the available services from Amazon, Google, Microsoft or whatever. Need on-premises deployment? Check Domino Data Lab, their product is mature and allows easy working with models (from building till deployment). Cloudera is more focused on cluster computing (including Spark), but it will take a while before they have something mature.
[EDIT] I'd recommend to have a look at Apache PredictionIO, open source machine learning server - amazing project with lot's of potential.
I have been able to just get this to work. Caveats: Python 3.6 + using Spark ML API (not MLLIB, but sure it should work the same way)
Basically, follow this example provided on MSFT's AzureML github.
Word of warning: the code as-is will provision but there is an error in the example run() method at the end:
#Get each scored result
preds = [str(x['prediction']) for x in predictions]
result = ",".join(preds)
# you can return any data type as long as it is JSON-serializable
return result.tolist()
Should be:
#Get each scored result
preds = [str(x['prediction']) for x in predictions]
#result = ",".join(preds)
# you can return any data type as long as it is JSON-serializable
output = dict()
output['predictions'] = preds
return json.dumps(output)
Also, completely agree with MLeap assessment answer, this can make the process run way faster but thought I would answer the question specifically

How to deploy Deep Learning Model(Python) to web app(Node)

Hi I am very interested in machine learning.
familiar with keras and tensorflow and know how to train model and save it.
I want to build node.js(loopback) web application that provides machine learning API.
I'm quite new to web application and don't know how to utilize python code in the node app.
To be specific, I'm trying to make deep style application like this site for practice purpose.
For codes that is open to public in github. most starred implementation is in torch and
I am currently looking (TF).
It has script work as "main.py" that consists of
1) loading pretrained VGG model(about 500Mb) to memory.
2) and using each framework's function, it create output photo inside folder
So I want to run this with GPU instance in AWS
problem is, I don't know how to call "main.py" inside javascript.
Q1 How did many machine learning based services achieved this??
did they implement their server in python like django to load trained model and inference from it?
But I want to integrate in node.js especially loopback. how can I achieve this in a most efficient and common way??
Q2 Is calling "main.py" efficient whenever there's request to server?
I mean, If the script is called everytime request is made, It loads 500Mb object to memory all the time and that's so inefficient.
and as far as I know AWS GPU instance can utilize 4 GPU at a time. I wonder how can I utilize all of it with most efficient and common way...
Thank you for reading!
I'm too frustrated with myself... I really want to know method deploy my model to server
and thank you in advance
You can communicate between NodeJS and e.g. Django with a REST interface. Deploy a Django server with the REST framework and call an url from NodeJS. This way, you can trigger the python code. You can also let the python code call your NodeJS with another REST interface on the javascript side if that is needed.

Apache Spark - Map Reduce Debugging Tips

I am trying to implement Kafka - Spark environment. I am trying to debug my MapReduce logic (implemented in Java). Spark submit step is making it complicated to debug with break points in my algorithms. Incoming live data patterns are complex. It will be very time consuming process to simulate the complex algorithms. Better development environment will help developers to validate their map reduce logic on live stream data.
Please suggest any tips and tricks. Is it possible to have IDE breakpoints, or remote debugging with Apache-Spark.
I don't think it matters if you are developing streaming or batch spark application. You can always use intellij idea for graphical debugging of your application.
Also look at this video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ. In the end if you want to debug reactivity of your application according to data patterns I suggest to use Spark's internal tools to see for example how DAG is being created or how it's being distributed.
One thing I myself am working on is to use spark debugging tools, and build a log according to my application execution graph with added information that I get from profilers (usual OS tools like iotop or iostat) to find for example where I am not utilizing resources enough.
In the end you need these information together to make a decision and ironically it itself can become a data intensive application.

Resources