as said in here
there are some overlap between UIMA and spark in distribution infrastructures. I was planning to use UIMA with spark. (now i am moving to UIMAFit) Can any one tell me what are the problems we really face when we develop uima with spark.
And what are the possible encounters.
(Sorry I haven't done any research on this.)
The main problem is accessing objects because UIMA tries to re instantiate objects when running their analyse engines. if the objects has local references then there will be a problem with accessing from a remote spark cluster. some RDD functions might not work within UIMA context. however if you don't use a separate remote cluster then there won't be a problem. (I am talking about uima-fit 2.2)
Related
My team is trying to transition from Zeppelin to Jupyter for an application we've built, because Jupyter seems to have more momentum, more opportunities for customization, and be generally more flexible. However, there are a couple of things Zeppelin we haven't been able to equivalents for in Jupyter.
The main one is to have multi-lingual Spark support - is it possible in Jupyter to create a Spark data frame that's accessible via R, Scala, Python, and SQL, all within the same notebook? We've written a Scala Spark library to create data frames and hand them back to the user, and the user may want to use various languages to manipulate/interrogate the data frame once they get their hands on it.
Is Livy a solution to this in the Jupyter context, i.e. will it allow multiple connections (from the various language front-ends) to a common Spark back-end so they can manipulate the same data objects? I can't quite tell from Livy's web site whether a given connection only supports one language, or whether each session can have multiple connections to it.
If Livy isn't a good solution, can BeakerX fill this need? The BeakerX website says two of its main selling points are:
Polyglot magics and autotranslation, allowing you to access multiple languages in the same notebook, and seamlessly communicate between them;
Apache Spark integration including GUI configuration, status, progress, interrupt, and tables;
However, we haven't been able to use BeakerX to connect to anything other than a local Spark cluster, so we've been unable to verify how the polyglot implementation actually works. If we can get a connection to a Yarn cluster (e.g. an EMR cluster in AWS), would the polyglot support give us access to the same session using different languages?
Finally, if neither of those work, would a custom Magic work? Maybe something that would proxy requests through to other kernels, e.g. spark and pyspark and sparkr kernels? The problem I see with this approach is that I think each of those back-end kernels would have their own Spark context, but is there a way around that I'm not thinking of?
(I know SO questions aren't supposed to ask for opinions or recommendations, so what I'm really asking for here is whether a possible path to success actually exists for the three alternatives above, not necessarily which of them I should choose.)
Another possible is the SoS (Script of Scripts) polyglot notebook https://vatlab.github.io/sos-docs/index.html#documentation.
It supports multiple Jupyter kernels in one notebook. SoS has several natively supported languages (R, Ruby, Python 2 & 3, Matlab, SAS, etc). Scala is not supported natively, but it's possible to pass information to the Scala kernel and capture output. There's also a seemingly straightforward way to add a new language (already with a Jupyter kernel); see https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html
I am using Livy in my application. The way it works is any user can connect to a already established spark session using REST (asynchronous calls). We have a cluster on which Livy sends Scala code for execution. It is up to you whether you want to close the session after sending the scala code or not. If the session is open then any one having access can send Scala code once again to do further processing. I have not tried sending different languages in the same session created through Livy but I know that Livy supports 3 languages in interactive mode i.e. R, Python and Scala. So, theoretically you would be able to send code in any language for execution.
Hope it helps to some extent.
I want to use some of the classifiers provided by MLLib (random forests, etc) but I want to use them without connecting to a Spark cluster.
If I need to somehow run some Spark stuff in-process so that I have a Spark context to use, that's fine. But I haven't been able to find any information or an example for such a use case.
So my two questions are:
Is there a way to use the MLLib classifiers without a Spark context at all?
Otherwise, can I use them by starting a Spark context in-process, without needing any kind of actual Spark installation?
org.apache.spark.mllib models:
Cannot be trained without Spark cluster.
Usually can be used for predictions without cluster, with exception to distributed models like ALS.
org.apache.spark.ml models:
Require Spark cluster for training.
Require Spark cluster for predictions although it might change in the future (https://issues.apache.org/jira/browse/SPARK-10413)
There is a number of third party tools which are designed to export Spark ml models to the form which can be used in Spark agnostic environment (jpmml-spark and modeldb to enumerate a few, without special preference).
Spark mllib models have limited PMML support as well.
Commercial vendors usually provide their own tools for productionizing Spark models.
You can of course use local "cluster", but it is probably still a bit to heavy for most of possible applications. Starting a full context take at least a few seconds, and has significant memory footprint.
Also:
Best Practice to launch Spark Applications via Web Application?
How to serve a Spark MLlib model?
Basically what I need to do is to integrate the CTBNCToolkit with Apache Spark, so this toolkit can take advantage of the concurrency and clustering features of Apache Spark.
In general I want to know is there any way exposed by Apache Spark developers to integrate any Java/Scala library in a fashion that the machine learning library can run on top of Spark's concurrency management?
So the goal is to make the stand alone machine learning libraries faster and concurrent.
No, that's not possible.
So what you want is that any algorithm runs on Spark. But, to parallelize the work, Spark uses RDDs or Datasets. So in order to run your tasks in parallel, the algorithms would have to use these classes.
The only thing that you could try, is to write your own Spark program, that makes use of any other library. But I'm not sure whether that's possible in your case. However, isn't Spark ML enough for you?
How can one train (fit) a model in a distributed big data platform (e.g Apache Spark) yet use that model in a stand alone machine (e.g. JVM) with as little dependency as possible?
I heard of PMML yet I am not sure if it is enough. Also Spark 2.0 supports persistent model saving yet I am not sure what is necessary to load and run those models.
Apache Spark persistence is about saving and loading Spark ML pipelines in JSON data format (think of it as Python's pickle mechanism, or R's RDS mechanism). These JSON data structures map to Spark ML classes. They don't make sense on other platforms.
As for PMML, then you can convert Spark ML pipelines to PMML documents using the JPMML-SparkML library. You can execute PMML documents (doesn't matter whether they came from Apache Spark, Python or R) using the JPMML-Evaluator library. If you're using Apache Maven to manage and build your project, then JPMML-Evaluator can be included by adding just one dependency declaration to your project's POM.
I am working on a project of Twitter Data Analysis using Apache Spark with Java and Cassandra for NoSQL databases.
In the project I am working I want to maintain a arraylist of linkedlist(will use Java in built Arraylist and Linkedlist) which is common to all mapper nodes. I mean, if one mapper writes some data into arraylist it should be reflected to all other mapper nodes.
I am aware of broadcast shared variable, but that is read only shared variable, what I want is shared writable dataframe where changes by one mapper should be reflected in all.
Any advice on how to achieve this in apache spark with Java will be of great help.
Thanks in advance
Short, and most likely disappointing, answer is it is not possible given Spark architecture. Worker nodes don't communicate with each other and neither broadcast variables nor accumulators (write-only variables) are really shared variables. You can try different workarounds like using external services or shared file system to communicate but it introduces all kind of issues like idempotency or synchronizing.
As far as I can tell the best thing you can get is updating state between batches or using tools like StreamingContext.remember.