About Java Cassandra Client, which one is better? How about CQL? [closed] - cassandra

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am trying to develop application using Hive as the Database, and then I also find noSQL solutions as an alternative to it.
Now decided to develop using Cassandra, my next problem is about what client should I use? which one is better, Hector -- a pure java solutions, or Kundera with JPA like development?
I prefer Hector, but I am curious about Kundera. Is there anyone using Kundera? Which is better?
I'm curious about CQL (Cassandra Query Language). Can it integrate with Hector?

Hector is slowly moving towards CQL integration. The first steps have been made, but because of the experience of an unstable API, the developers seem to have postponed a new release. The CQL API is rather new, as it should be nearly equivalent to a SQL syntax. I made some basic steps with CRUD operations to verify that data could be written and read via CQL.
Nevertheless, the CQL JAR is not usable out of the box like a standard JDBC driver as of now, and misses some important feature aspects. Having a look at the more or less difficult to understand thrift API and the not really much simpler hector API, I am convinced that CQL will be established as the state-of-the-art access API for Cassandra in version 0.8.1 and 1.0, where thrift will remain the native, raw access for some time.
The competition between both APIs has nothing to do with the decision of Hector. Hector itself provides additional services like failure and connection handling in the cluster. These are features being addressed by neither thrift nor CQL.
I don't really believe in all other O/R mappers, or even those claiming to provide a full-fledged JPA. I cannot imagine how this should work.

Answering your question about clients - Hector essentially provides access to the Cassandra native API (columns, column families, rows etc) whereas Kundera aims to hide these details and provide object-database mapping.
Kundera therefore probably makes it easier to quickly persist a range of Java objects into Cassandra - but may not provide an efficient mapping, perhaps losing some of the performance that noSQL approaches provide.
Hector expects you to adapt to the Cassandra data model - this will be harder work, but is likely to deliver more performance.

There is now a new client, Astyanax, released by Netflix in January 2012.
"Astyanax is a Java Cassandra client. It borrows many concepts from
Hector but diverges in the connection pool implementation as well as
the client API. One of the main design considerations was to provide a
clean abstraction between the connection pool and Cassandra API so
that each may be customized and improved separately. Astyanax provides
a fluent style API which guides the caller to narrow the query from
key to column as well as providing queries for more complex use cases
that we have encountered. The operational benefits of Astyanax over
Hector include lower latency, reduced latency variance, and better
error handling."
The source code for Astyanax is hosted at Github: https://github.com/Netflix/astyanax

For details about using CQL with Cassandra and Hector, see:
https://github.com/rantav/hector/wiki/Using-CQL
The following mail list thread is a good discussion on where we will be going with CQL as an API:
http://groups.google.com/group/hector-users/browse_thread/thread/540dc9c3908fbb44/f5ee488f2178e2f4

For the sake of completeness I think the Pelops library should be mentioned too. Hector seems to be the most used, but Pelops has a simpler API. Pelops does not support CQL.
Coming from Ruby I find both to be extremely verbose and imperative, though.

Kundera no more relies on Solandra for indexing approach. It enables you now to use secondary indexing support provided by Cassandra and as well as it gives you a way to run jpa queries over OPP (like range queries etc). We are working to enable native CQL support.
Take a look at:
http://mevivs.wordpress.com/2012/02/13/how-to-crud-and-jpa-association-handling-using-kundera/
for more details.
-Vivek

There is no java client in the same level with hector, hector is the best and there is work in progress in hector side to support cql. I saw cql commits for hector in github this month, but doesn't know it's final state. You can ask it to hector users group http://groups.google.com/group/hector-users
Also there is a very simple object mapper in hector
https://github.com/rantav/hector/wiki/Hector-Object-Mapper-%28HOM%29
My Best,
Serdar Irmak

Kundera 2.0.4 released:
Major Changes in this release:
Cross-datastore persistence( Easy to migerate existing mysql app over nosql)
support for relational databases (e.g Mysql etc)
replace solandra with lucene based indexing.
Support added for bi-directinal associations.
Performance improvement fixes.
We tested and 1 million inserts with proper indexing happened in 6 minutes.
Vivek

I am yet to try Hector, but am involved in latest Kundera 2.0.1 release. I suggest you give it a try. It has gone a major change since its inception and you can see a lot of new features getting added and bugs being fixed. Currently it supports JPA 1.0 and Cassandra 0.7.6 but we are planning to add support for Cassandra 0.8 and JPA 2.0 very soon. There is a pretty good example here: https://github.com/impetus-opensource/Kundera/wiki/Getting-started that may help you get started.

Astyanax api produces human-readable code and does include connection pooling.

CQL support over cassandra has been integrated in kundera 2.0.6(yet to be released). It allows to execute CQL as nativequery now.
-Vivek

Related

datastax driver vs spring-data-cassandra

Hey I am new to Cassandra and I am friendly with Spring jdbc-template.
Can anyone please explain difference between both of them? Also can you suggest which one is good to use ?
thanks.
spring-data-cassandra uses datastax's java-driver, so the decision to be made is really whether or not you need the functionality of spring-data.
Some features from spring data that may be useful for you (documented here):
spring xml configuration for configuring your Cluster instance (especially useful if you are already using spring).
object mapping component.
The java-driver also has a mapping component as well that is worth exploring.
In my opinion if you are already using spring, it is worth looking into spring-data-cassandra. Otherwise, it would be good to start off with just the datastax java-driver.

Is cql an abstraction implemented on top of thrift?

I'm having difficulty establishing how cql is implemented at the lowest level. Is it still making calls to the thrift interface or is it something completely different
It is possible to use CQL from either Thrift or the new native binary protocol. However the native drivers provide much better support for the more sophisticated Cassandra functions. Thrift is essentially deprecated.
In addition to the other answer, I'd like to add that CQL (and its drivers) interact with Cassandra via a new, binary protocol, and not via Thrift. The original row/column structure of Cassandra is still intact, but it is largely abstracted away by CQL.
There was a great blog post done on this subject out on Planet Cassandra: Understanding How CQL3 Maps To Cassandra’s Internal Data Structure. It makes for quick read, and has some simple examples that explain how the abstraction provided by CQL3 functions. I highly recommend running through the examples yourself on your own, local Cassandra instance.

Cassandra 1.2: Is CQL preferred over Thrift Based Clients

I'm finally getting the hang of Cassandra, part of the issue was learning / respecting the differences between Thrift and CQL3.
Many of the tutorials I am finding online are for CQL3. My question: Is CQL3 truly the preferred method, and is Thrift being discouraged? Reason I ask is I spent a couple of days trying to get what I needed through Pycassa which does not support Cassandra 1.2 and that is based on the Thrift model.
Is CQL3 truly the preferred method, and is Thrift being discouraged?
Short answer is yes.
Longer answer is:
CQL3 should be preferred for many reasons:
platform-agnostic language: CQL3 looks like SQL and is easier to handle that pure Thrift API code.
Higher level of abstraction: for end-users, it's easier to deal with CQL3 to query data rather than juggling with low-level Thrift API, although some good higher abstraction frameworks exist (Hector for Java, Pycassa for Python ...=
Easier to administer for operational teams: when creating a new table or adding new referential data, it is easier to write CQL3 scripts that ops teams can understand, check and execute rather than cryptic cassandra-cli scripts (set cf[rowKey][columnName] = ...). I'm migrating all our cassandra-cli scripts to CQL3 because it's a pain in the ass to maintain them
Last but not least, CQL3 make life easier for third-party framework developers. I've developed Achilles, an open-source persistence manager for Cassandra. The Thrift version was painfull to implement, the CQL3 version was a piece of cake escpecially because it uses the Java Driver from Datastax
That being said, CQL3 is no bed of roses either. Before leveraging the full power of the query language, you need to understand how Cassandra storage engine works. The language gives you the illusion that everything is easy and will work as SQL but the plain truth is no. There are some important semantics differences, especially when using the WHERE clause.
Yes, CQL is the preferred API. It is much easier to use and not all operations are supported through the Thrift API.
You can use cassandra-dbapi2 for CQL3 in Python: https://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/.
Short answer: Yes
However given how new it is compared to thrift based clients it is safe to assume there is vastly more thrift clients in production.
Given Datastax now produce their own Java driver that supports exclusively CQL3 it is probably a good idea to follow suite: https://github.com/datastax/java-driver

Thrift, CQL3 or what?

Recently I noticed the Cassandra and DataStax are pushing CQL3 more. A new java driver even released, this one does not use Thrift at all.
And if your are not going to use "compact storage" you will not able to use Thrift with your application. Thus, I believe that Thrift is fading out from Cassandra.
My question is, for a new application should I go head and use CQL3? However, I still prefer thrift because I want to know what's going on underneath and on the other hand I do not want to be using something that is fading out and becoming a legacy. What do you recommend?
My company recently went through the same thought process and ended up using CQL3 over thrift.
Although there is a slight lack of transparency with the additional layer of abstraction going on with CQL3, the ease and familiarity of writing SQL style statements makes the code much more readable and intuitive in my opinion. Plus we found the cqlsh interface far more user friendly than cassandra-cli for debugging and general db maintenance (the auto-complete is fab in cqlsh!).
Once you understand the underlying data structure and how CQL3 represents that data, the extra layer of abstraction pales into insignificance, really.
Datastax are encouraging developers to use cql3 for newer applications. From the Thrift to CQL3 Guide:
…we believe that CQL3 is a simpler and overall better API for Cassandra than the thrift API is. Therefore, new projects/applications are encouraged to use CQL3 (though remember that CQL3 is not final yet, and so this statement will only be fully valid with Cassandra 1.2). But the thrift API is not going anywhere.
Thrift won't be getting newer features (unless they are requested a lot) so it's safe to say that CQL3 is the better choice for new apps (of course there are exceptions… if you need low-level you need thrift). My only pain is that datastax's driver does not yet support SSL but that is in the pipeline and will hopefully be a committed feature soon.

Cassandra vs Riak [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am looking for an eventually consistent data store and it looks like it may be coming down to Riak or Cassandra. Has anyone got expereinces of a view on this?
As you probably know, they are both architecturally strongly influenced by Dynamo (eventually consistent, no single points of failure, etc). Both also go beyond Dynamo in providing a "richer than pure K/V" data model -- in Cassandra's case, providing a Bigtable-like ColumnFamily mode, in Riak's, a Document-oriented one. I have seen sane people choose both.
I believe points that favor Cassandra include
speed
support for clusters spanning multiple data centers
big names using it (digg, twitter, facebook, webex, ... -- http://n2.nabble.com/Cassandra-users-survey-tp4040068p4040393.html)
Points that favor Riak include
map/reduce support out of the box
/Cassandra dev, fwiw
Riak is used by
Mozilla Foundation
Ask.com sponsored listings
Comcast
Citigroup
Bet365
I think they both pass the test of credible reference customers/users.
Cassandra seems more mature, and is currently doing better in benchmarks. Riak seems easier to add a node to as your cluster grows.
For completeness: A good (probably biased) comparison between the two can be found at http://docs.basho.com/riak/1.3.2/references/appendices/comparisons/Riak-Compared-to-Cassandra/
Use and download are different. Best to get references.
Perhaps a private conversation could be had where Riak references in these companies could be shared? Not sure how to get such with Cassandra, but there is a community of companies that support Cassandra that seem like a good place to start. As these probably have community participants in Cassandra development, it may be a REALLY reasonable place to start.
I would like to hear Riak's answer to recent and large deployments where customers are happy.
I also would like to see the roadmap for each product. Cassandra is a bit easier to track (http://wiki.apache.org/cassandra/) than Riak in my view as Cassandra's wiki discusses limitations and things that are probably going to change going forward, but neither outline futures well. I could understand that of an open source community ... perhaps ... but I cannot for a product for which I must pay.
I also would suggest research of Cloudant, which has what appears to be a very nice layering of capabilities. It also looks like it is bringing to bear the capabilities elsewhere in Apache land. CouchDB is the Apache platform on which Cloudant is based. BUT the indexing with Lucene seems but the tip of the iceberg when it comes to where Cloudant could go. Creating and managing an index is a very systematic process, a kind of data pipeline, that could be scripted using other Apache community assets. AND capabilities like NLP also could be added through Lucene indirectly, or maybe directly into what is persisted.
It would be nice to see a proposed Cloudant roadmap, especially since the team could mine the riches of the Apache community and integrate such into Cloudant. Such probably exists as there is an operational component to the Cloudant revenue model that will require it, if for no other reason.
Another area of interest ... Cloudant's pricing model ... it is clear their revenue model is not based on software, but around service. That is quite attractive, and it seems consistent with the ecosystem surrounding Cassandra too. I don't know if the Basho folks have won over enough of the nosql community as yet ... don't see such from any buzz around their web site or product.
I like this Cloudant web page (https://cloudant.com/the-data-layer/). I was surprised to see the embedded Erlang capability ... I did not know CouchDB was written in Erlang as this seems unusual to me in the Apache community (my ignorance); CouchDB appears to be older than other nosql products I know (now) to be written in Erlang. Whatever their strategy, they at least count Amazon EC2 and Microsoft Azure as hosting partners, indicating an appreciation of Microsoft and !Microsoft worlds - all very important if properly recognizing the middleware value potential (beyond cache or hash table applications) that these types of data stores could have.
Finally, while I don't know the board well, Andy Palmer's guidance looks like it will be valuable. He can bring some guidance vis-a-vis structured data (through VoltDB) to a world that rightly or wrongly may be unfairly branded as KVP hash tables of unstructured data. The need for structure and ecosystem surrounding nosql "databases" is being recognized ... witness Google's efforts with Spanner ... KVP/little structure/need for search-ability motivated Google's investment in the Spanner space. While we all may not need something like Spanner, we probably do need an improving and robust "enterprise" management and interoperability capability in these nosql databases to make it reasonable to incorporate them into modern cloud architectures. The needed structure can come from ease of interoperability and functional richness. It can also come from new capabilities that support conversion of unstructured data to structured data (e.g. indexes, use of NLP to create structured and parsed renderings of things inside of a KVP blob, and plenty of other things that, if put into a roadmap and published, could entice and grow a user base). Cloudant looks like it has a good chance of success ... I will take a closer look at it ...
And look what I found about CouchDB ...
CouchDB comes with a suite of features, such as on-the-fly document transformation and real-time change notifications, that makes web app development a breeze. It even comes with an easy to use web administration console. You guessed it, served up directly out of CouchDB! We care a lot about distributed scaling. CouchDB is highly available and partition tolerant, but is also eventually consistent. And we care a lot about your data. CouchDB has a fault-tolerant storage engine that puts the safety of your data first.

Resources