Apache Spark Java APIs limitations - apache-spark

Can someone provide me some sample Java APIs that are yet to be implemented in Apache Spark.I am trying to see if there are any Scala Spark APIs that "do not exist/have limited functionality" if I decide to use the Java APIs rather.
That would be a deal-breaker for me.
Disclaimer:
Based on my googling/analysis I realize that Scala community support for Apache Spark is really good.Also I understand that in order to work efficiently with Spark you need to learn some Scala anyway(As source code is in Scala).

Optimistic point of view:
Consider that:
The standard Scala backend is a Java VM. Scala classes are Java classes, and vice versa. You can call the methods of either language from methods in the other one. You can extend Java classes in Scala, and vice versa. The main limitation is that some Scala features do not have equivalents in Java, for example traits.
Conclusion - there is no missing API
Pessimistic point of view:
Spark is written in Scala has Scala-centric API and is not Java friendly. There multiple packages (like GraphX) which have no Java friendly API. You need code like this once in a while.

Related

What is the best way to expose Cassandra REST API to web?

I would like to work with Cassandra from javascript web app using REST API.
REST should support basic commands working with DB - create table, select/add/update/remove items. Will be perfect to have something similar to odata protocol.
P.S. I'm looking for some library or component. Java is a most preferred.
Staash solution looks perfect for the task - https://github.com/Netflix/staash
You can use DataStax drivers. I used it via Scala but you can use Java, a Session object is a long-lived object and it should not be used in a request/response short-lived fashion but it's up to you.
ref. rules when using datastax drivers
There is no "best" language for REST APIs, it depends on what you're comfortable using. Virtually all languages will be able to do this reasonable well, depending on your skill level.
The obvious choice is probably java, because cassandra's written in java, the java driver from Datastax is well supported, and because it's probably pretty easy to find some spring REST frameworks to do what you want. Second beyond that would be python - again, good driver support and REST frameworks with things like django or flask+potion. Ruby driver isn't bad, lots of ruby REST APIs out there, too.

datastax driver vs spring-data-cassandra

Hey I am new to Cassandra and I am friendly with Spring jdbc-template.
Can anyone please explain difference between both of them? Also can you suggest which one is good to use ?
thanks.
spring-data-cassandra uses datastax's java-driver, so the decision to be made is really whether or not you need the functionality of spring-data.
Some features from spring data that may be useful for you (documented here):
spring xml configuration for configuring your Cluster instance (especially useful if you are already using spring).
object mapping component.
The java-driver also has a mapping component as well that is worth exploring.
In my opinion if you are already using spring, it is worth looking into spring-data-cassandra. Otherwise, it would be good to start off with just the datastax java-driver.

Cassandra 1.2: Is CQL preferred over Thrift Based Clients

I'm finally getting the hang of Cassandra, part of the issue was learning / respecting the differences between Thrift and CQL3.
Many of the tutorials I am finding online are for CQL3. My question: Is CQL3 truly the preferred method, and is Thrift being discouraged? Reason I ask is I spent a couple of days trying to get what I needed through Pycassa which does not support Cassandra 1.2 and that is based on the Thrift model.
Is CQL3 truly the preferred method, and is Thrift being discouraged?
Short answer is yes.
Longer answer is:
CQL3 should be preferred for many reasons:
platform-agnostic language: CQL3 looks like SQL and is easier to handle that pure Thrift API code.
Higher level of abstraction: for end-users, it's easier to deal with CQL3 to query data rather than juggling with low-level Thrift API, although some good higher abstraction frameworks exist (Hector for Java, Pycassa for Python ...=
Easier to administer for operational teams: when creating a new table or adding new referential data, it is easier to write CQL3 scripts that ops teams can understand, check and execute rather than cryptic cassandra-cli scripts (set cf[rowKey][columnName] = ...). I'm migrating all our cassandra-cli scripts to CQL3 because it's a pain in the ass to maintain them
Last but not least, CQL3 make life easier for third-party framework developers. I've developed Achilles, an open-source persistence manager for Cassandra. The Thrift version was painfull to implement, the CQL3 version was a piece of cake escpecially because it uses the Java Driver from Datastax
That being said, CQL3 is no bed of roses either. Before leveraging the full power of the query language, you need to understand how Cassandra storage engine works. The language gives you the illusion that everything is easy and will work as SQL but the plain truth is no. There are some important semantics differences, especially when using the WHERE clause.
Yes, CQL is the preferred API. It is much easier to use and not all operations are supported through the Thrift API.
You can use cassandra-dbapi2 for CQL3 in Python: https://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/.
Short answer: Yes
However given how new it is compared to thrift based clients it is safe to assume there is vastly more thrift clients in production.
Given Datastax now produce their own Java driver that supports exclusively CQL3 it is probably a good idea to follow suite: https://github.com/datastax/java-driver

Does Apache Camel replace Apache commons pipeline

I am implementing a parallelized data processing system that involves a bunch of conversions and filters of data as it moves through multiple stages. I recognize the Apache Commons Pipeline project as a good fit for this requirement, but Apache Camel seems to provide a superset of that functionality. Does Camel replace the Commons Pipeline?
Apache Camels goal is more to be a mediator/routing engine in distributed systems and systems integration. That said, as you notice, it is very lightweight and could easily serve as an engine for parallellized execution of data flows. I don't think you should see camel as a replacement, rather an alternative.

Are all Java SE classes available in Java ME?

I'm a Java newbie. Wanted to know if all Java SE classes are available in Java ME. If not why is it so?
No, only a subset is available, see http://java.sun.com/javame/technology/index.jsp for an introduction.
A brief overview is given in this Wikipedia article:
Noteworthy limitations
Compared to the Java SE environment, several APIs are absent entirely, and some APIs are altered such that code requires explicit changes to support CLDC. In particular, certain changes aren't just the absence of classes or interfaces, but actually change the signatures of existing classes in the base class library. An example of this is the absence of the Serializable interface, which does not appear in the base class library due to restrictions on reflection usage. All java.lang.* classes which normally implement Serializable do not, therefore, implement this tagging interface.
Other examples of limitations depend on the version being used, as some features were re-introduced with version 1.1 of CLDC.
CLDC 1.0 and 1.1
The Serializable interface is not supported.
Parts of the reflection capabilities of the Java standard edition:
The java.lang.reflect package and any of its classes not supported.
Methods on java.lang.Class which obtain Constructors or Methods or Fields.
No finalization. CLDC does not include the Object.finalize() method.
Limited error handling. Non-runtime errors are handled by terminating the application or resetting the device.
No Java Native Interface (JNI)
No user-defined class loaders
No thread groups or daemon threads.
It's worth noting that where J2ME versions of J2SE classes are apparently available, they often have a reduced API. So you can't always assume that code using 'available' classes will port straight over.
If memory serves, there are one or two methods with differening names too. Memory doesn't serve well enough right now to recall a specific example.
No, Java ME is a significantly restricted subset of Java SE. Java SE is an enormous standard library, and most of the devices Java ME is intended to run on don't have the resources to support all that overhead.
Take a look at the javadocs for CLDC 1.1, the main, universally supported API accessible to Java ME.
No they are not. The reason for this is the standard library is quite large, making it difficult to use on embedded devices with small amounts of memory and slower processors.
See this page for more info about whats included and whats not.

Resources