Confusing between Thrift API and CQL - cassandra

I am working in a Java web application, using NoSQL (target is Cassandra). I use Astyanax as Cassandra client since it is suggested the best client of Cassandra for now. I've just approached Cassandra for 2 weeks, so many things is so weird to me.
During my working, I encountered some problems and I do not know how to overcome:
Is table created from CQL like column family created by Thrift API? I feel they are similar, but maybe there are some differences behind. For example:
table create by CQL command cannot be accessed by Thrift API
Thrift-based APIs cannot work with tables created by CQL, but CQL methods can access column family created by Thrift API!
​Is primary key in table correspond to row key in column family?
In CQL I can declare a table which contains a collection/set/map inside. Can I do the same thing in Thrift API?
If my application needs both of them (column families and tables), how can they deal with each other?
I recognize one thing: I cannot use Thrift API to do manipulating data on tables create by CQL, and vice versa. I wonder that that, how can I remember which table/column family created from which way so that I can use the correct APIs to process data? For the time being, we don't have a general way to handle two of them, do we? AFAIK, Thrift API and CQL do not have a same interface, so they cannot understand each other?!
Could you please help me explain these things? Thank you so much.

Yes. It's impossible to update the Thrift APIs to be CQL-aware without breaking existing applications. So if you use CQL you are committing to using CQL clients only like the Java driver, and not Astyanax, Hector, et al. But this is no great sacrifice since CQL is much more usable.
For a simple PK (i.e., single column), yes. For a compound PK, it's a bit more complicated.
No. The Thrift API operates at a lower level, by design. (So you'd see the individual storage cells that make up the Map, for instance.)
I don't understand the question. With CQL you can do everything you could do with Thrift, but more easily.
Simple; don't mix the two. Stick with one or the other.

In my opinion, I believe focus is shifting towards making cassandra look like a RDBMS with SQL Queries to gain wider adoption.
But with inconsistencies between work done using Hector/Astyanax(thrift) and CQL, i think it will hurt adoption. Its almost a U turn from hector/astyanax to CQL in the middle of the journey.
Atleast CQL should have been planned in such a way that Thrift api (and high level java apis on top of it) have no problem in transitioning.

Related

Is Cassandra just a storage engine?

I've been evaluating Cassandra to replace MySQL in our microservices environment, due to MySQL being the only portion of the infrastructure that is not distributed. Our needs are both write and read intensive as it's a platform for exchanging raw data. A type of "bus" for lack of better description. Our selects are fairly simple and should remain that way, but I'm already struggling to get past some basic filtering due to the extreme limitations of select queries.
For example, if I need to filter data it has to be in the key. At that point I can't change data in the fields because they're part of the key. I can use a SASI index but then I hit a wall if I need to filter by more than one field. The hope was that materialized views would help with this but in another post I was told to avoid them, due to some instability and problematic behavior.
It would seem that Cassandra is good at storage but realistically, not good as a standalone database platform for non-trivial applications beyond very basic filtering (i.e. a single field.) I'm guessing I'll have to accept the use of another front-end like Elastic, Solr, etc. The other option might be to accept the idea of filtering data within application logic, which is do-able, as long as the data sets coming back remain small enough.
Apache Cassandra is far more than just a storage engine. Its design is a distributed database oriented towards providing high availability and partition tolerance which can limit query capability if you want good and reliable performance.
It has a query engine, CQL, which is quite powerful, but it is limited in a way to guide user to make effective queries. In order to use it effectively you need to model your tables around your queries.
More often than not, you need to query your data in multiple ways, so users will often denormalize their data into multiple tables. Materialized views aim to make that user experience better, but it has had its share of bugs and limitations as you indicated. At this point if you consider using them you should be aware of their limitations, although that is generally good idea for evaluating anything.
If you need advanced querying capabilities or do not have an ahead of time knowledge of what the queries will be, Cassandra may not be a good fit. You can build these capabilities using products like Spark and Solr on top of Cassandra (such as what DataStax Enterprise does), but it may be difficult to achieve using Cassandra alone.
On the other hand there are many use cases where Cassandra is a great fit, such as messaging, personalization, sensor data, and so on.

Cassandra version differences

I started reading Cassandra the definitive guide, which is based on Cassandra 0.7. Now, I'm trying to experiment with Cassandra 2.1.5 and it seems that there's a lot of differences which is really confusing.
For example, I see that in 0.7 version CQL did not exist. On the other hand, data model seems quite different. You can now define a schema with CQL, while in version 0.7 there was no schema.
Can anyone shortly explain the differences, especially about the data model?
I understand that in 0.7 version the idea was about different length rows, that is, rows that have different number of columns. But now I understand that each column is actually a field that contains a number of parameters, so you can have as much fields as you want within the same row (same key).
Can someone summarize the differences? Maybe I did not understand correctly.
An important point to consider, is that the underlying storage model remains the same. CQL is simply an abstraction layer on top of that model, to make it easier to work with and model your data. DataStax MVP John Berryman has a great article on this: Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
In this article, Berryman observes that:
The value of the CQL primary key is used internally as the row key (which in the new CQL paradigm is being called a “partition key”).
The names of the non-primary key CQL fields are used internally as columns names. The values of the non-primary key CQL fields are then internally stored as the corresponding column values.
Additionally, he outlines the benefits of using the CQL-based approach:
It provides fast look-up by partition key and efficient scans and slices by cluster key.
It groups together related data as CQL rows. This means that you can do in one query what would otherwise take multiple queries into different column families.
It allows for individual fields to be added, modified, and deleted independently.
It is strictly better than the old Cassandra paradigm. Proof: you can coerce CQL Tables to behave exactly like old-style Cassandra ColumnFamilies. (See the examples here.)
It extends easily to implementation of sets lists and maps (which are super ugly if you’re working directly in old cassandra) — but that’s for another blog post.
The CQL protocol allows for asynchronous communication as compared with the synchronous, call-response communication required by Thrift. As a result, CQL is capable of being much faster and less resource intensive than Thrift – especially when using single threaded clients.
can have as much fields as you want within the same row (same key).
Actually, there is a hard limit of about 2 billion columns per partition (rowkey).

Is it possible to insert/write data without defining columns in Cassandra?

I am trying to understand the fundamentals of Cassandra data model. I am using CQL. As per I know the schema must be defined before anyone can insert into new columns. If someone needs to add any column can use ALTER TABLE and can INSERT value to that new column.
But in cassandra definitive guide there is written that Cassandra is schema less.
In Cassandra, you don’t define the columns up front; you just define the column
families you want in the keyspace, and then you can start writing data without defining
the columns anywhere. That’s because in Cassandra, all of a column’s names are
supplied by the client.
I am getting confused and not finding any expected answer. Can someone please explain it to me or tell me if I am missing somthing?
Thanks in advance.
Theres two different APIs to interact with Cassandra for writing data. First there's the thrift API which always allowed to create columns dynamically, but also supports adding meta data for your columns.
Next theres the newer CQL based API. CQL was created to provide another abstraction layer that would make it more user friendly to work with Cassandra. With CQL you're required to define a schema upfront for your column names and datatypes. However, that doesn't mean its not possible to use dynamic columns using CQL.
See here for differences:
http://www.datastax.com/dev/blog/thrift-to-cql3
You are reading "Cassandra, the definitive guide": a 3/4 years old book that is telling you something that has changed long time ago. Now you have to define the tables structure before being able to write data.
Here you can find some reasons behind CQL introduction and the schema-less abandonment.
The official Datastax documentation should be your definitive guide.
HTH,
Carlo

Using Cassandra as a "schemaless NoSQL database"

I'm looking at using Cassandra for an enterprise web-site I'm working on, which could be used by up to 250 million users. Cassandra seems like an obvious choice because of the way it scales, although I was a little sad not to be able to use a schema-less database like Couch (for political reasons I won't go in to).
I've read that you can still use Cassandra like a schema-less database, using either a super-column or simply serializing objects in to normal columns. At the moment I'm using .NET for my front-end.
Are there any libraries out there already that help with using Cassandra in this way?
Has anyone done anything like this already using .NET? Any tips?
Any advice gratefully received!
Thanks,
Steve.
Datomic is schemaless. Attributes are modeled and generic objects can be created, saved, queried with any combination of attributes.
http://www.datomic.com
http://docs.datomic.com/storage.html#cassandra

How to understand CQL3 table model from scratch?

I recently visited http://cassandra.apache.org/doc/cql3/CQL.html#CQLSyntax and just read this:
[...] please note that as such, they do not refer to the concept of
rows and columns found in the internal implementation of Cassandra and
in the thrift and CQL v2 API.
If we assume that I've understood Cassandras data model (column family, etc.) I do not understand where CQL3 differs from?
Is the table model of CQL3 related to Cassandras Column Family however?
I mean, how about perfomance issues?
How is the CQL3 implementation compared to the relational tables?
What is the internal implementation and/or concept of CQL3's row-and-column-model?
I know that there is something like Composite Column model. Is this the difference to the classical Cassandra row-column-model?
Moreover I am interested in theoretical details.
Asking these questions: What did I not understand?
Unfortunately I just started getting deeper into CQL. But I am very familiar with the cassandra-cli and (my)SQL.
This should be your one-stop-shopee to learn CQL data model:
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts

Resources