Hbase Thrift in CDH 5

Hbase Thrift in CDH 5 - node.js

I'm using Node.js Thrift API to connect to Hbase. Everything was working great until I upgraded CDH 4.6 to CDH 5. After upgrading I regenerated the Thrift API for Node.js with this script:
thrift --gen js:node /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hbase/include/thrift/hbase2.thrift
After replacing the original Node.js script with the newly generated script, everything stopped working.
You can view the new script and the basic methods in the demo that I'm trying to run on https://github.com/lgrcyanny/Node-HBase-Thrift2
When I run the 'get' method, it returns "Internal error processing get".
When I run the 'put' method, it returns "Invalid method name: 'put'".
It seems like the new Thrift API is completely incompatible? Am I missing anything here?

There are two Thrift IDL files that come with HBase:
hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift
hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift2/Hbase.thrift
Both have a get() method, but only one of them has a put() method, which is exactly what your error messages above are telling us.
Cited from the package summary page:
There are currently 2 thrift server implementations in HBase, the
packages:
org.apache.hadoop.hbase.thrift: This may one day be
marked as depreceated.
org.apache.hadoop.hbase.thrift2: i.e. this
package. This is intended to closely match to the HTable interface and
to one day supercede the older thrift (the old thrift mimics an API
HBase no longer has).
Also the install guides have a separate section for that scenario:
CDH 5 HBase Compatibility
CDH 5 HBase is [...] not wire
compatible with CDH 4 [...]. Consequently,
rolling upgrades from CDH 4 to CDH 5 are not possible because existing
CDH 4 HBase clients cannot make requests to CDH 5 servers and CDH 5
HBase clients cannot make requests to CDH 4 servers. Clients of the
Thrift and REST proxy servers, however, retain wire compatibility
between CDH 4 and CDH 5. [...]
The HBase User API (Get, Put, Result, Scanner etc; see Apache HBase
API documentation) has evolved and attempts have been made to make
sure the HBase Clients are source code compatible and thus should
recompile without needing any source code modifications. This cannot
be guaranteed however, since with the conversion to ProtoBufs, some
relatively obscure APIs have been removed. Rudimentary efforts have
also been made to preserve recompile compatibility with advanced APIs
such as Filters and Coprocessors. These advanced APIs are still
evolving and our guarantees for API compatibility are weaker here.

Related

Is cassandra-driver-core dependency removed as part of DSE Cassandra 5.x Java driver?

With DSE Cassandra 5.x, in the client code is cassandra-driver-core to be excluded from dependency due to deprecation? Is dse-java-driver-core to be used instead?

I'm not 100% sure on what you're referring, I think that the primary reason is rework of the authentication support, and other things that are specific to DSE driver. OSS driver supports only username/password authentication, while DSE driver also supports Kerberos, plus mixed internal/external authentication schemas.
But you can safely replace cassandra-driver-core with dse-java-driver-core - code is compatible, the same Cluster/Session, only if you don't need geo types, graph support, etc. Look here for full list of differences.

Upgrade issue in Application querying schema_keyspaces and using RoundRobin policy

In our java application version 1 which is using cassandra 2.1
At startup we are executing query : "*SELECT * from system.schema_keyspaces;*" to get keyspace info (if this fails application wont start)
However new code we are getting the keypspace information from driver's cluster.metadata instance which is using cassandra 3.11
We are using DC aware RoundRobin load balancing policy of java
datastax driver.
Now consider a scenario with upgrade of 3 nodes : A,B and C, where A is upgraded (new application + Cassandra 3.11), upgrade on B is under process (Cassandra is down here) and C is not upgraded (old application + Cassandra 2.1). and your client application on C node restarts.
I am getting InvalidQueryException if the old query present on java client of C node gets executed on A (as client will send query in round robin way). if it fails there is no handling in old application. How can we resolve this issue ?
com.datastax.driver.core.exceptions.InvalidQueryException: un-configured table schema_keyspaces
One way i figured out that remove A's Ip from contact points of client application + peers table on C Cassandra node . Now Restart the client application. and then Cassandra to restore peers table entry.
Other way is keep restarting the client application on C until client application query actually hit the Cassandra 2.1 and successfully restarts. But that seems ugly to me.

In your application it's better to explicitly set protocol version to match Cassandra 2.1, instead of trying to rely on auto-negotiation features. Driver's documentation explicitly says about this.
According the compatibility matrix you need to explicitly set protocol version to V3, but this also depends on the driver version, so you may need to stuck to version 2.
Cluster cluster = Cluster.builder()
.addContactPoint("xxxx")
.withProtocolVersion(ProtocolVersion.V3)
.build();
After upgrade to 3.11 is done you can switch to protocol version 4.

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?

I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.

Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.

I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

Differences betweeen Hector Cassandra and JDBC

I'm currently starting a project that use Cassandra Apache. So I'm interesting in accessing to my database cassandra from Java. For that, I'm using Hector Cassandra. However, I've some doubts about what's the differences between the access via Hector or JDBC Cassandra (specifically this: https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/).
I believe the following (although I not sure if I'm right):
one difference between both could be that are API of different level (I consider that Hector Cassandra is an API of higher-level than JDBC Cassandra)?
in JDBC Cassandra is used CQL for accessing/modifying the database, while Hector Cassandra don't use CQL (only use the methods provided for that).
I'll be thankful if someone can help me and tell me if I'm right/wrong in the previous lines and more differences between both (Hector and JDBC Cassandra).
Thank in advance!

Official Cassandra Java Driver (https://github.com/datastax/java-driver) is probably the best (IMHO, the only) choice for a new project for several reasons:
New features
All other Cassandra clients (Hector, Astyanax, etc) are based on legacy Thrift RPC protocol. RPC "One response per one request" model has severe limitations, for example it doesn't allow processing several requests at the same time in a single connection or streaming large ResultSets.
So, DataStax developed a new protocol that doesn't have RPC limitations. Thrift API won't be getting new features, it's only kept for backward-compatibility. In contrast, Java Driver is actively developed to incorporate the new features of Cassandra 2.0, like conditional updates, batching prepared statements, etc. The overview of new features is here: http://www.datastax.com/dev/blog/cql-in-cassandra-2-0
Convenience
In early Cassandra days (0.7) in our company we have used in-house low-level Thrift client. Later on we have used Hector, Pelops and Astyanax in various projects. I can say that the clients based on Java Driver look the most simple and clean to me.
Performance
We have made some performance testing of Cassandra Java Driver vs other clients. In most scenarios the performance is roughly the same. However, there are certain situations when Cassandra Java Driver significantly outperforms other clients due to its asynchronous nature.
Btw, there's a couple of related questions with excellent answers:
Advantages of using cql over thrift
Cassandra Client Java API's
EDIT: When I wrote this, I wasn't aware that Achilles (https://github.com/doanduyhai/Achilles) mentioned in another answer has CQL implementation that works via Java Driver. For the same of completeness I must say that Achilles' DAO on top of CQL might be (or might became one day) viable alternative to plain CQL via Java Driver.

#mol
Why do you restrict to Hector and cassandra-jdbc if you're starting a new project ?
There are many other interesting choices:
Astyanax as Martin mentioned (Thrift & CQL3)
FireBrand (Thrift via Hector)
Achilles I've just developed (CQL3 & Cassandra 2.0 via Java driver core)
Java Driver Core for plain CQL3

Hector is indeed a higher-level API. Internally it will use Cassandra's Thrift API to execute its functions. It will not convert them to equivalent CQL calls. But its API also provides access to CQL. In this case it will pass the CQL (via Thrift) to Cassandra's APIs for CQL.
CQL in Cassandra is a SQL-like language that works via the Cassandra APIs. So it does not provide any additional capability in the use of Cassandra than the APIs but does make it easier at times to use. If you are considering using Hector I would also look at Astyanax which is a newer take on a high-level Java API to Cassandra.

Since you are starting a new project, it is best to start with CQL as Java native driver:
http://www.datastax.com/documentation/developer/java-driver/1.0/webhelp/index.html#common/drivers/introduction/introArchOverview_c.html
Per DataStax, it is 10-15% faster than Thrift APIs, as it uses Binary Protocol.

Cassandra and hector upgrade to 1.0 and unit testing

We have been using Cassandra 0.7 and since the stable version of cassandra 1.0.0 is out, we planned to upgrade to it. Its low risk since we are not in production yet. We were using hector 0.7-29 which had testutils package and a EmbeddedServerHelper class that we used to start a embedded server in all our unit tests.
However the upgraded version of hector 1.0-1 (which is for cassandra 1.0.x) has removed this package (me.prettyprint.cassandra.testutils) from its core distribution.
I would like to know the plan moving forward for unit testing using the new hector 1.0-1 api client. Is there a way to start cassandra embedded server any more?
Thanks for your help.

There is a new 'test' module which holds EmbeddedSchemaLoader and EmbeddedServerHelper. We took them out of core so they could be used outside of Hector (as the module now has no direct dependency on hector).
https://github.com/rantav/hector/tree/master/test
Let us know how everything works out.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string