Does John Snow Labs’ NLP library built on top of Apache Spark support Java - apache-spark

John Snow Labs’ NLP library built on top of Apache Spark and Spark ML library.
All its examples are provided in scala and python. Does it support java? If yes where can I find the related guides? If not is there any plan to support java?

In general, Scala libraries only need a dedicated Java API if their API (not the implementation) exposes functionality with no Java equivalent. Unfortunately, standard Scala function types are an example, at least until Scala 2.12 and Java 8. E.g. Spark makes a lot of use of ClassTags and implicits, which makes it hard to use directly from Java.
But this library is based on Spark ML, which doesn't have a separate Java API, and from a quick look, doesn't seem to need one (at least for the new DataFrame-based API). You can see its examples in Java at https://spark.apache.org/docs/2.3.0/ml-pipeline.html.
So the NLP library just creates instances of Transformer, Pipeline and other Spark ML types, and the code for creating them is trivially translatable to Java. You just need to know that Array(...) corresponds to new T[] { ... } (where T is the type of arguments). From this it doesn't seem to need a Java API, even if it could benefit from giving examples in Java. Unfortunately, it doesn't appear to provide even a Scaladoc link so I could see whether there is something in the API which is problematic to use from Java.

Related

Distributed Rules Engine

We have been using Drools engine for a few years now, but our data has grown, and we need to find a new distributed solution that can handle a large amount of data.
We have complex rules that look over a few days of data and that why Drools was a great fit for us because we just had our data in memory.
Do you have any suggestions for something similar to drools but distributed/scalable?
I did perform a research on the matter, and I couldn't find anything that answers our requirement.
Thanks.
Spark provides a faster application of Drools rules to the data than traditional single-node applications. The reference architecture for the Drools - Spark integration could be along the following lines. In addition, HACEP is a Scalable and Highly Available architecture for Drools Complex Event Processing. HACEP combines Infinispan, Camel, and ActiveMQ. Please refer to the following article for on HACEP using Drools.
You can find a reference implementation of Drools - Spark integration in the following GitHub repository.
In the first place, I can see for huge voluminous data as well we can apply Drools efficiently out of my experiences with it (may be some tuning is needed based on your kind of requirements). and is easily integrated with Apache Spark. loading your rule file in memory for spark processing will take minute memory... and Drools can be used with spark streaming as well as spark batch jobs as well...
See my complete article for your reference and try.
Alternative to it might be ....
JESS implements the Rete Engine and accepts rules in multiple formats including CLIPS and XML.
Jess uses an enhanced version of the Rete algorithm to process rules. Rete is a very efficient mechanism for solving the difficult many-to-many matching problem
Jess has many unique features including backwards chaining and working memory queries, and of course Jess can directly manipulate and reason about Java objects. Jess is also a powerful Java scripting environment, from which you can create Java objects, call Java methods, and implement Java interfaces without compiling any Java code.
Try it yourself.
Maybe this could be helpful to you. It is a new project developed as part of the Drools ecosystem. https://github.com/kiegroup/openshift-drools-hacep
It seems like Databricks is also working on Rules Engine. So if you are using the Databricks version of Spark, something to look into.
https://github.com/databrickslabs/dataframe-rules-engine
Take a look at https://www.elastic.co/blog/percolator
What you can do is convert your rule to an elasticsearch query. Now you can percolate your data against the percolator which will return you the rules that match the provided data

Driver code in one language and executors in different languages

How can I use different programming languages to define the logic of my executors, than what I use for the driver? Is this possible at all?
E.g.: I would write the driver in Scala, then call different functions written in Java, Python for the distributed processing of the dataset.
You could, but only under certain circumstances, and with some work.
It should be possible to use the code generation feature of SparkSQL/DataSet to implement methods in other languages and call them through JNI or other interfaces.
Furthermore, the generated code is Java code, so technically you are already running Java code, independently of which language you use to program the Spark program.
As far as I know, it's also possible to use Python UDFs inside a Spark program written in Java or Scala.
With the RDD API it should also be possible to call libraries in other programming languages - with Scala-Java mixes being trivial to implement, and non-JVM languages needing the appropriate bridging logic.
There's - at least in current versions of Spark - a performance penalty to pay, for getting data out of the JVM and back into it, so I would use this sparingly, and only when you have weighed the performance pros and cons carefully.

Mahout recommender, Flink, Spark MLLib, 'gray box'

I'm new to Mahout-Samsara and I'm trying to understand the "domain" of the different projects and how they relate to each other.
I understand that Apache Mahout-Samsara deprecates many MapReduce algorithms, and that things will be based on Apache Flink or Spark or other engines like h2o ( based on the introduction of the "Apache Mahout: Beyond MapReduce" book).
I want to try some recommender algorithms but I'm not so sure about what's new and what's 'deprecated'. I see the following links,
Mahout Recommender overview
Mahout Coocurrence intro
referring to spark-rowsimilarity and spark-itemsimilarity. (I don't understand if these links are talking about an off-the-self algorithm or a design... it's probably a design because they are not listed at mahout dot apachedot org/users/basics/algorithms.html ... anyways...).
And at the same time, Apache Flink (or is it Spark MLLib?) implements the ALS algorithm for recommendation (Machine Learning for Flink and Spark MLlib).
General questions:
Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?
Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?
Other links to help the conversation:
Flink Vision and
Roadmap
Mahout Algorithms
Specific question:
I want to try a recommender algorithm as a 'gray box' (part 'black box' because I don't want to get too deep into the math, part 'white box' because I want to tweak the model and the math to the extent that I need to improve results).
I'm not interested in other ML algorithms yet. I thought about starting with what's off-the-shelf and then changing the ALS implementation of MLLib. Would that be a good approach? Any other suggestions?
I've been working on ML on Flink for a while now and I'm doing my fair load of scouting and I'm monitoring what is going in this ecosystem. What you're asking implies a rational coordination between projects that simply doesn't exist. Algorithms get reimplemented over and over and for what I see, it's easier to do so than integrate with different frameworks. Samsara it's actually one of the most portable solutions but it's good just for a few applications.
Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?
This as I said, would require a coordination between projects that it's not a thing.
Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?
They should be the first thing in a ideal ecosystem, but they will keep building their own ML libraries for commercial purposes: computing engines with ML libraries out of the box sell really well. Actually I'm working full time on Flink ML not because I believe it's necessarily the best way to do ML on Flink, but because, right now, it's something Flink requires to be sold in many environments.
#pferrel suggested PredictionIO that is an excellent software but there are many alternatives under development: for example Beam is designing a Machine Learning API to generalize over different runners' implementations (Flink, Spark, H2O, and so on). Another alternative are data analysis platforms like Knime, RapidMiner and others, that can build pipelines over Spark or other Big Data tools.
Spark-itemsimilarity and spark-rowsimialrity are command line accessible drivers. They are based on classes in Mahout-Samsara. The description of these is for running code supported since v0.10.0.
The link https://mahout.apache.org/users/basics/algorithms.html shows which algos are supported on which "compute-engine". Anything in the "Mapreduce" column is in line for deprecation.
That said, Mahout-Samsara is less a collection of algorithms than pre-0.10.0 Mahout was. It now has a R-like DSL, which includes generalized tensor math, from which most of the Mahout-Samsara algos have been built. So think of Mahout as a "roll-you-own math and algorithm" tool. But every product is scalable on your choice of compute engine. The engine's themselves are also available natively so you don't have to use only the abstracted DSL.
Regarding how Mahout-Samsara relates to MLlib or any algo lib, there will be overlap and either can be used in your code interchangeably.
Regarding recommenders, the new SimilarityAnalysis.cooccurrence implements a major innovation, called cross-occurrence that allows a recommender to ingest almost anything known about a user or user's context and even accounts for item-content similarity. The Mahout-Samsara part is the engine for Correlated Cross-Occurrence. See some slides here describing the algorithm: http://www.slideshare.net/pferrel/unified-recommender-39986309
There is a full, end-to-end implementation of this using the PredictionIO framework (PIO itself is now a proposed Apache incubator project) that is mature and can be installed using these instructions: https://github.com/actionml/cluster-setup/blob/master/install.md

API compatibility between scala and python?

I have read a dozen pages of docs, and it seems that:
I can skip learning the scala part
the API is completely implemented in python (I don't need to learn scala for anything)
the interactive mode works as completely and as quickly as the scala shell and troubleshooting is equally easy
python modules like numpy will still be imported (no crippled python environment)
Are there fall-short areas that will make it impossible?
In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
Original answer as of Spark 0.9
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
Spark 0.8.1 added support for persist(), sample(), and sort().
The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
Spark 0.9 also adds Python bindings for MLLib (docs).
I've implemented tools to help keep the Java API up-to-date.
As of Spark 0.9, the main missing features in PySpark are:
zip() / zipPartitions.
Support for reading and writing non-text input formats, like Hadoop SequenceFile (there's an open pull request for this).
Support for running on YARN clusters.
Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
Support for job cancellation.
Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
Original answer as of Spark 0.7.2:
The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.
I'd like to add some points about why many people who have used both APIs recommend the Scala API. It's very difficult for me to do this without pointing out just general weaknesses in Python vs Scala and my own distaste of dynamically typed and interpreted languages for writing production quality code. So here are some reasons specific to the use case:
Performance will never be quite as good as Scala, not by orders, but by fractions, this is partly because python is interpreted. This gap may widen in future as Java 8 and JIT technology becomes part of the JVM and Scala.
Spark is written in Scala, so debugging Spark applications, learning how Spark works, and learning how to use Spark is much easier in Scala because you can just quite easily CTRL + B into the source code and read the lower levels of Spark to suss out what is going on. I find this particularly useful for optimizing jobs and debugging more complicated applications.
Now my final point may seem like just a Scala vs Python argument, but it's highly relevant to the specific use case - that is scale and parallel processing. Scala actually stands for Scalable Language and many interpret this to mean it was specifically designed with scaling and easy multithreading in mind. It's not just about lambda's, it's head to toe features of Scala that make it the perfect language for doing Big Data and parallel processing. I have some Data Science friends that are used to Python and don't want to learn a new language, but stick to their hammer. Python is a scripting language, it was not designed for this specific use case - it's an awesome tool, but the wrong one for this job. The result is obvious in the code - their code is often 2 - 5x longer than my Scala code as Python lacks a lot of features. Furthermore they find it harder to optimize their code as they are further away from the underlying framework.
Let me put it this way, if someone knows both Scala and Python, then they will nearly always choose to use the Scala API. The only people IME that use Python are those that simply do not want to learn Scala.

JVM languages for J2ME platform

I'm currently writing an embedded application for J2ME environment (CLDC 1.1 configuration and IMP-NG profile). Being spoiled by all those new features in JVM-based languages (Groovy, Scala, Clojure, you name it), I was considering using one of them for my code.
However, most of the mentioned languages require pretty decent JVM environment. Most so-called "dynamic" languages require the VM to have reflection. Many ask for annotations support. None of the above features are available under J2ME.
From what I found, Xtend looks like a viable options, as its compiler spits out plain Java, not bytecode, and doesn't require any library for runtime needs. Of course, the generated Java code also must meet some requirements, but Xtend webpage looks promising in this regard:
Xtend just does classes and nothing else
Interface definitions in Java are already nice and concise. They have a decent default visibility and also in other areas there is very little to improve. Given all the knowledge and the great tools being able to handle these files there is no reason to define them in a different way. The same applies for enums and annotation types.
That's why Xtend can do classes only and relies on interfaces, annotations and enums being defined in Java. Xtend is really not meant to replace Java but to modernize it.
Am I right and it is possible to compile Xtend-generated code for J2ME platform, or there are some constructs that will not work there?
Alternatively, can you recommend any other "rich" Java modification language that can be run on J2ME?
Update: Knowing that the "compiler" producing results as another source code is called transcompiler, one can also find Mirah, a tool which requires no runtime library and specific Java features.
Xtend's generated code uses google guava heavily. If that is compatible to the J2ME, Xtend could be the language of your choice. I'm not aware of anything that prevents from using it on other platforms that provide a dedicated development kit (e.g. Android).
In addition to being able to generate Java source, Mirah recently added support for javac's --bootclasspath option, which allows you to generate your bytecode against a non-standard version of the java core classes, e.g. LeJOS.
It's still a little fresh, but it'd be nice to have more people using it on different javas.

Resources