Integration of Arbitrary Java Machine Learning with Apache Spark - apache-spark

Basically what I need to do is to integrate the CTBNCToolkit with Apache Spark, so this toolkit can take advantage of the concurrency and clustering features of Apache Spark.
In general I want to know is there any way exposed by Apache Spark developers to integrate any Java/Scala library in a fashion that the machine learning library can run on top of Spark's concurrency management?
So the goal is to make the stand alone machine learning libraries faster and concurrent.

No, that's not possible.
So what you want is that any algorithm runs on Spark. But, to parallelize the work, Spark uses RDDs or Datasets. So in order to run your tasks in parallel, the algorithms would have to use these classes.
The only thing that you could try, is to write your own Spark program, that makes use of any other library. But I'm not sure whether that's possible in your case. However, isn't Spark ML enough for you?

Related

Alternative to Apache livy for Dask distributed

Dask is a pure python based distributed computing platform, similar to Apache Spark.
Is there a way to run & monitor Dask distributed jobs/tasks through REST API, like Apache Livy for Apache Spark?
Not quite what you ask, but take a look at prefect which has a strong integration with dask (for task execution).

Spark Machine Learning running on a single machine: Is it distributed or not?

Recently I'm learning scalable machine learning and Spark MLlib is the first tool I learned to use. I have already succeeded to implement some simple machine learning tasks such as linear regression with Spark MLlib, and they all run smoothly on my laptop.
However, I'm wondering, the program is not deployed on a cluster, and it's running on a single node. Is it still not distributed in this kind of scenario. If it's distributed, does Spark automatically run tasks with multi-threads?
Can anybody tell me the reason why Spark MLlib makes scalable machine learning implementation easier?
Well, it depends on what your definition of "distributed" is.
Spark MLlib is a framework that allows (but not guarantees) you to write code that is capable of being distributed. It handles a lot of the distribution and synchronisation issues that come with distributed computing. So yes, it makes it much simpler for programmers to code and deploy distributed algorithms.
The reason why Spark makes scalable ML easier is because you can focus more on the algorithm, rather than being bogged down by data races and how to distribute code to different nodes, taking into account data locality etc. All of that is typically handled by the SparkContext / RDD class.
That being said, coding for Spark does not guarantee that it will be distributed optimally. There are still things to consider like data partitioning and level of parallelism, among many others.

Can MLLib classifiers be trained and used without a Spark installation?

I want to use some of the classifiers provided by MLLib (random forests, etc) but I want to use them without connecting to a Spark cluster.
If I need to somehow run some Spark stuff in-process so that I have a Spark context to use, that's fine. But I haven't been able to find any information or an example for such a use case.
So my two questions are:
Is there a way to use the MLLib classifiers without a Spark context at all?
Otherwise, can I use them by starting a Spark context in-process, without needing any kind of actual Spark installation?
org.apache.spark.mllib models:
Cannot be trained without Spark cluster.
Usually can be used for predictions without cluster, with exception to distributed models like ALS.
org.apache.spark.ml models:
Require Spark cluster for training.
Require Spark cluster for predictions although it might change in the future (https://issues.apache.org/jira/browse/SPARK-10413)
There is a number of third party tools which are designed to export Spark ml models to the form which can be used in Spark agnostic environment (jpmml-spark and modeldb to enumerate a few, without special preference).
Spark mllib models have limited PMML support as well.
Commercial vendors usually provide their own tools for productionizing Spark models.
You can of course use local "cluster", but it is probably still a bit to heavy for most of possible applications. Starting a full context take at least a few seconds, and has significant memory footprint.
Also:
Best Practice to launch Spark Applications via Web Application?
How to serve a Spark MLlib model?

Drawbacks of using embedded Spark in Application

I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.

What is the differences between Apache Spark and Apache Apex?

Apache Apex - is an open source enterprise grade unified stream and batch processing platform. It is used in GE Predix platform for IOT.
What are the key differences between these 2 platforms?
Questions
From a data science perspective, how is it different from Spark?
Does Apache Apex provide functionality like Spark MLlib? If we have to built scalable ML models on Apache apex how to do it & which language to use?
Will data scientists have to learn Java to built scalable ML models? Does it have python API like pyspark?
Can Apache Apex be integrated with Spark and can we use Spark MLlib on top of Apex to built ML models?
Apache Apex an engine for processing streaming data. Some others which try to achieve the same are Apache storm, Apache flink. Differenting factor for Apache Apex is: it comes with built-in support for fault-tolerance, scalability and focus on operability which are key considerations in production use-cases.
Comparing it with Spark: Apache Spark is actually a batch processing. If you consider Spark streaming (which uses spark underneath) then it is micro-batch processing. In contrast, Apache apex is a true stream processing. In a sense that, incoming record does NOT have to wait for next record for processing. Record is processed and sent to next level of processing as soon as it arrives.
Currently, work is under progress for adding support for integration of Apache Apex with machine learning libraries like Apache Samoa, H2O
Refer https://issues.apache.org/jira/browse/SAMOA-49
Currently, it has support for Java, Scala.
https://www.datatorrent.com/blog/blog-writing-apache-apex-application-in-scala/
For Python, you may try it using Jython. But, I haven't not tried it myself. So, not very sure about it.
Integration with Spark may not be good idea considering they are two different processing engines. But, Apache apex integration with Machine learning libraries is under progress.
If you have any other questions, requests for features you can post them on mailing list for apache apex users: https://mail-archives.apache.org/mod_mbox/incubator-apex-users/

Resources