I am trying to understand the fundamentals of Cassandra data model. I am using CQL. As per I know the schema must be defined before anyone can insert into new columns. If someone needs to add any column can use ALTER TABLE and can INSERT value to that new column.
But in cassandra definitive guide there is written that Cassandra is schema less.
In Cassandra, you don’t define the columns up front; you just define the column
families you want in the keyspace, and then you can start writing data without defining
the columns anywhere. That’s because in Cassandra, all of a column’s names are
supplied by the client.
I am getting confused and not finding any expected answer. Can someone please explain it to me or tell me if I am missing somthing?
Thanks in advance.
Theres two different APIs to interact with Cassandra for writing data. First there's the thrift API which always allowed to create columns dynamically, but also supports adding meta data for your columns.
Next theres the newer CQL based API. CQL was created to provide another abstraction layer that would make it more user friendly to work with Cassandra. With CQL you're required to define a schema upfront for your column names and datatypes. However, that doesn't mean its not possible to use dynamic columns using CQL.
See here for differences:
http://www.datastax.com/dev/blog/thrift-to-cql3
You are reading "Cassandra, the definitive guide": a 3/4 years old book that is telling you something that has changed long time ago. Now you have to define the tables structure before being able to write data.
Here you can find some reasons behind CQL introduction and the schema-less abandonment.
The official Datastax documentation should be your definitive guide.
HTH,
Carlo
Related
I started reading Cassandra the definitive guide, which is based on Cassandra 0.7. Now, I'm trying to experiment with Cassandra 2.1.5 and it seems that there's a lot of differences which is really confusing.
For example, I see that in 0.7 version CQL did not exist. On the other hand, data model seems quite different. You can now define a schema with CQL, while in version 0.7 there was no schema.
Can anyone shortly explain the differences, especially about the data model?
I understand that in 0.7 version the idea was about different length rows, that is, rows that have different number of columns. But now I understand that each column is actually a field that contains a number of parameters, so you can have as much fields as you want within the same row (same key).
Can someone summarize the differences? Maybe I did not understand correctly.
An important point to consider, is that the underlying storage model remains the same. CQL is simply an abstraction layer on top of that model, to make it easier to work with and model your data. DataStax MVP John Berryman has a great article on this: Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
In this article, Berryman observes that:
The value of the CQL primary key is used internally as the row key (which in the new CQL paradigm is being called a “partition key”).
The names of the non-primary key CQL fields are used internally as columns names. The values of the non-primary key CQL fields are then internally stored as the corresponding column values.
Additionally, he outlines the benefits of using the CQL-based approach:
It provides fast look-up by partition key and efficient scans and slices by cluster key.
It groups together related data as CQL rows. This means that you can do in one query what would otherwise take multiple queries into different column families.
It allows for individual fields to be added, modified, and deleted independently.
It is strictly better than the old Cassandra paradigm. Proof: you can coerce CQL Tables to behave exactly like old-style Cassandra ColumnFamilies. (See the examples here.)
It extends easily to implementation of sets lists and maps (which are super ugly if you’re working directly in old cassandra) — but that’s for another blog post.
The CQL protocol allows for asynchronous communication as compared with the synchronous, call-response communication required by Thrift. As a result, CQL is capable of being much faster and less resource intensive than Thrift – especially when using single threaded clients.
can have as much fields as you want within the same row (same key).
Actually, there is a hard limit of about 2 billion columns per partition (rowkey).
I am new to Cassandra and this may have been covered somewhere, but, I haven't been able to find it here, on Planet Cassandra, or, in the DataStax documentation.
I have inherited a set of keyspaces created by another programmer who has left the company. There is a particular data item that he was supposed to have stored in the keyspace, however, it's not listed in the schema (displayed from the Cassandra CLI).
The programmer stated that it was in the 'blob', however, there aren't any columns in the keyspace defined as a 'blob'.
When I use the DataStax DevCenter tool, however, there is a 'key' column listed as a 'blob' that isn't in the schema...
key (blob)
assignExpirydate (text)
bookingClass (text)
... etc.
Since it wasn't in the schema, I'm assuming that the column is created by Cassandra, and, not what I'm looking for, but, I would like to verify that.
So, my question is, is there some documentation that refers to (or person that knows) whether Cassandra creates this column? A quick explanation of it would be appreciated as well.
Thanks
If the table was created using an older version of cassandra through the cassandra-cli, you may not see it in dev center. You should take a look at the docs on using "thrift" tables from "cql3" to see what is going on. If you use the cassandra-cli and do a list on the table it should show you all the data in there. When using thrift you can insert data with any column name you want, it doesn't have to be defined in the schema.
Links to thrift/cql3 information:
http://www.datastax.com/dev/blog/thrift-to-cql3
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
I am working in a Java web application, using NoSQL (target is Cassandra). I use Astyanax as Cassandra client since it is suggested the best client of Cassandra for now. I've just approached Cassandra for 2 weeks, so many things is so weird to me.
During my working, I encountered some problems and I do not know how to overcome:
Is table created from CQL like column family created by Thrift API? I feel they are similar, but maybe there are some differences behind. For example:
table create by CQL command cannot be accessed by Thrift API
Thrift-based APIs cannot work with tables created by CQL, but CQL methods can access column family created by Thrift API!
Is primary key in table correspond to row key in column family?
In CQL I can declare a table which contains a collection/set/map inside. Can I do the same thing in Thrift API?
If my application needs both of them (column families and tables), how can they deal with each other?
I recognize one thing: I cannot use Thrift API to do manipulating data on tables create by CQL, and vice versa. I wonder that that, how can I remember which table/column family created from which way so that I can use the correct APIs to process data? For the time being, we don't have a general way to handle two of them, do we? AFAIK, Thrift API and CQL do not have a same interface, so they cannot understand each other?!
Could you please help me explain these things? Thank you so much.
Yes. It's impossible to update the Thrift APIs to be CQL-aware without breaking existing applications. So if you use CQL you are committing to using CQL clients only like the Java driver, and not Astyanax, Hector, et al. But this is no great sacrifice since CQL is much more usable.
For a simple PK (i.e., single column), yes. For a compound PK, it's a bit more complicated.
No. The Thrift API operates at a lower level, by design. (So you'd see the individual storage cells that make up the Map, for instance.)
I don't understand the question. With CQL you can do everything you could do with Thrift, but more easily.
Simple; don't mix the two. Stick with one or the other.
In my opinion, I believe focus is shifting towards making cassandra look like a RDBMS with SQL Queries to gain wider adoption.
But with inconsistencies between work done using Hector/Astyanax(thrift) and CQL, i think it will hurt adoption. Its almost a U turn from hector/astyanax to CQL in the middle of the journey.
Atleast CQL should have been planned in such a way that Thrift api (and high level java apis on top of it) have no problem in transitioning.
My column family stores data in the column names and I want to perform range query on the columns using Astyanax. Can anyone suggest how to do that?
There are lots example on range query available here
Sample one
keyspace.prepareQuery(CF_TIME_UUID)
.getKey(rowKey)
.withColumnRange(
new RangeBuilder()
.setLimit(10)
.setStart(TimeUUIDUtils.getTimeUUID(0))
.setEnd(TimeUUIDUtils
.getTimeUUID(Long.MAX_VALUE >> 8))
.build()).execute();
I agree with abhi and this is exactly how Playorm has implemented it. You may see the code of columnSlice() API at https://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/cassandra/CassandraSession.java
Also, if you are using Playorm for Cassandra you can just use its ColumnSlice API. The example is given at https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestColumnSlice.java
Actually, you could just use PlayOrm Query language as well if you final goal is getting the entities out.
I understand that with Cassandra, it is possible to search using secondary indexes, but the problem is I am trying to search on information from a super column. So I want to search on a value within a super column, but return everything within that row (not just that one super column).Is this possible to do?
My understanding is that Facebook and Twitter use Cassandra, and so it would seem quite pointless if they have search facilities but it is not possible to search using something built into Cassandra.
Please correct me if I have not understood the proper use of super columns within Cassandra.
Thanks.
You cannot search on a super column value, as secondary indexes are not supported for SCs. You should avoid using super columns for a variety of reasons, but mostly because they are effectively deprecated. Most super column use cases are supported through the use of composites--which will ultimately replace SCs. In the meantime, if you must search for a value in a SC, you will have to do so manually (i.e. in code) or using an external tool such as Hadoop or Solr.