I'm new to cassandara and NoSql Database. as per my understanding when you say it is NoSQL it means it should accept all data when you insert values(it is schema free). i.e. I have created a table in cassandra, it contains 5 fields. First insert query I inserted only 5 values, it is success. Next I tried 6 values, it throws error saying there is Unmatched column names/values (6th field). If cassandra is NoSQL then that 6th field should be inserted into Table.
I did this Google. Few people suggested saying user alter Query to change schema. If that is the case, I can use alter schema in SQL also. Then why I need to go to NoSQL?
Is my understanding is correct?
Unmatched column names/values
com.datastax.driver.core.exceptions.InvalidQueryException: Unmatched column names/values
Yes, Cassandra is a NoSQL database. A NoSQL database can be broadly defined as a database which stores and maintains data in non relational way(No SQL), can store web scale data easily, can scale out and is generally distributed. Cassandra ticks all the boxes for to be called as a NoSQL database.
Coming back to your question regarding requirement of a schema. Cassandra used to provide (still provide as deprecated feature) to add columns on the go using thrift API. Thrift API is going to be completely removed in Cassandra 4.0. Cassandra now supports schema based CQL.
You can still design your table to add columns dynamically using CQL, like
CREATE TABLE keyspace.table_name (
partition_key text,
column_name text,
column_value text,
PRIMARY KEY ((partition_key),column_name));
Now you can group all rows consisting of column_name and column_value with partition_key and rows sorted by column_name.
Related
I want to use the IN clause for the non-primary key column in Cassandra. Is it possible? if it is not is there any alternate or suggestion?
Three possible solutions
Create a secondary index. This is not recommended due to performance problems.
See if you can designate that column in the existing table as part of the primary key
Create another denormalised table that table is optimised for your query. i.e data model by query pattern
Update:
And also even after you move that to primary key, operations with IN clause can be further optimised. I found this cassandra lookup by list of primary keys in java very useful
Im trying to learn cassandra but im confused with the terminology.
Many instances it says the row stores key/value pairs.
but, when I define a table its more like declaring a SQL table ie; you create a table and specify the column names and data types.
Can someone clarify this?
Cassandra is a column based NoSQL database. While yes at its lowest level it does store simple key-value pairs it stores these key-value pairs in collections. This grouping of keys and collections is analogous to rows and columns in a traditional relational model. Cassandra tables contain a schema and can be referenced (with restrictions) using a SQL-like language called CQL.
In your comment you ask about Apples being stored in a different table from oranges. The answer to that specific question is No it will be in the same table. However Cassandra tables have an additional concept call the Partition Key that doesn't really have an analgous concept in the relational world. Take for example the following table definition
CREATE TABLE fruit_types {
fruit text,
location text,
cost float,
PRIMARY KEY ((fruit), location)
}
In this table definition you will notice that we are defining the schema for the table. You will also notice that we are defining a PRIMARY KEY. This primary key is similar but not exactly like a relational concept. In Cassandra the PRIMAY KEY is made up of two parts the PARTITION KEY and CLUSTERING COLUMNS. The PARTITION KEY is the first fields specified in the PRIMARY KEY and can contain one or more fields delimitated by parenthesis. The purpose of the PARTITION KEY is to be hashed and used to define the node that owns the data and is also used to physically divide the information on the disk into files. The CLUSTERING COLUMNS make up the other columns listed in the PRIMARY KEY and amongst other things are used for defining how the data is physically stored on the disk inside the different files as specified by the PARTITION KEY. I suggest you do some additional reading on the PRIMARY KEY here if your interested in more detail:
https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Basically cassandra storage is like sparse matrix, earlier version has a command line tool called cqlsh which can show the exact storage foot print of your columnfamily(aka table in latest version). Later community decided to keep RDBMS kind of syntax for better understanding coz the query language(CQL) syntax is similar to sql.
main storage is key(partition) (which is hash function result of chosen partition column in your table and rest of the columns will be tagged to it like sparse matrix.
I have a table like this
CREATE TABLE my_table(
category text,
name text,
PRIMARY KEY((category), name)
) WITH CLUSTERING ORDER BY (name ASC);
I want to write a query that will sort by name through the entire table, not just each partition.
Is that possible? What would be the "Cassandra way" of writing that query?
I've read other answers in the StackOverflow site and some examples created single partition with one id (bucket) which was the primary key but I don't want that because I want to have my data spread across the nodes by category
Cassandra doesn't support sorting across partitions; it only supports sorting within partitions.
So what you could do is query each category separately and it would return the sorted names for each partition. Then you could do a merge of those sorted results in your client (which is much faster than a full sort).
Another way would be to use Spark to read the table into an RDD and sort it inside Spark.
Always model cassandra tables through the access patterns (relational db / cassandra fill different needs).
Up to Cassandra 2.X, one had to model new column families (tables) for each access pattern. So if your access pattern needs a specific column to be sorted then model a table with that column in the partition/clustering key. So the code will have to insert into both the master table and into the projection table. Note depending on your business logic this may be difficult to synchronise if there's concurrent update, especially if there's update to perform after a read on the projections.
With Cassandra 3.x, there is now materialized views, that will allow you to have a similar feature, but that will be handled internally by Cassandra. Not sure it may fit your problem as I didn't play too much with 3.X but that may be worth investigation.
More on materialized view on their blog.
I am new to cassandra, I am using cassandra datastax driver to access my keyspace. I have a legacy table which is created by using cassandra thrift client. I am in need of retrieving two column values from each partion in one query. It is like multigetslice Query in hector api. How can I do this using cql and DataStax Java driver?
--edit--
My column family is a legacy table, which looks like the following in cqlsh.
CREATE TABLE messages (
key blob,
column1 text,
value blob,
PRIMARY KEY ((key), column1)
).
I need to select two values for each key. In this table i used to store messages of each user. userid as rowkey, messageid as columnname and message as value. I need to show two latest messages from each user.
Try using an IN filter condition.
I think you should execute one request per partition (execute concurrently if getting more than one parition). Assuming you want the top two in the natural order of 'column1':
SELECT column1, value FROM messages WHERE key=<blob> LIMIT 2;
I need details from both performance and query aspects, I learnt from some site that only a key can be given when using a columnfamily, if so what would you suggest for my keyspace, I need to use group by, order by, count, sum, ifnull, concat, joins, and some times nested queries.
To answer the original question you posed: a column family and a table are the same thing.
The name "column family" was used in the older Thrift API.
The name "table" is used in the newer CQL API.
More info on the APIs can be found here:
http://wiki.apache.org/cassandra/API
If you need to use "group by,order by,count,sum,ifnull,concat ,joins and some times nested querys" as you state then you probably don't want to use Cassandra, since it doesn't support most of those.
CQL supports COUNT, but only up to 10000. It supports ORDER BY, but only on clustering keys. The other things you mention are not supported at all.
Refer the document: https://cassandra.apache.org/doc/old/CQL-3.0.html
It specifies that the LRM of the CQL supports TABLE keyword wherever COLUMNFAMILY is supported.
This is a proof that TABLE and COLUMNFAMILY are synonyms.
In cassandra there is no difference between table and columnfamily. they are one concept.
For Cassandra 3+ and cqlsh 5.0.1
To verify, enter into a cqlsh prompt within keyspace (ksp):
CREATE COLUMNFAMILY myTable (
... id text,
... name int
);
And type 'desc myTable'.
You'll see:
CREATE TABLE ksp.myTable (
... id text,
... name int
);
They are synonyms, and Cassandra uses table by default.
here small example to understands concept.
A keyspace is an object that holds the column families, user defined types.
Create keyspace University
with replication={'class':SimpleStrategy,
'replication_factor': 3};
create table University.student(roll int Primary KEY,
dept text,
name text,
semester int)
'Create table', table 'Student' will be created in the keyspace 'University' with columns RollNo, Name and dept. RollNo is the primary key. RollNo is also a partition key.
All the data will be in the single partition.
Key aspects while altering Keyspace in Cassandra
Keyspace Name: Keyspace name cannot be altered in Cassandra.
Strategy Name: Strategy name can be altered by specifying new strategy name.
Replication Factor: Replication factor can be altered by specifying new replication factor.
DURABLE_WRITES :DURABLE_WRITES value can be altered by specifying its value true/false. By default, it is true. If set to false, no updates will be written to the commit log and vice versa.
Execution: Here is the snapshot of the executed command "Alter Keyspace" that alters the keyspace strategy from 'SimpleStrategy' to 'NetworkTopologyStrategy' and replication factor from 3 to 1 for DataCenter1.
Column family are somewhat related to relational database's table, with a distribution differences and maybe even idealistic character.
Imaging you have a user entity that might contain 15 column, in a relational db you might want to divide the columns into small-related-column-based struct that we all know as Table. In distributed db such as Cassandra you'll be able to concatenate all those tables entry into a single long row, so if you'll use profiler/ db manager you'll see a single table with 15 columns instead of 2/3 tables. Another interesting thing is that every column family is written to different nodes, maybe on different cluster and be recognized by the row key, meaning that you'll have a single key to all the columns family and won't need to maintain a PK or FK for every table and maintain the relationships between them with 1-1, 1-n, n-n relations. Easy!