When I try to find information on composite columns, I cannot find anything newer than 2013 (specifically this one is Google's top link, which has no CQL code when talking about using the composite columns, and apparently uses a very old Java driver). Do composite columns still exist in newer versions of Cassandra? I mean, apart from having a composite key.
I am new to Cassandra and actually want to learn if they are suitable for my use-case, described in the following. Consider a table with 4 double-valued columns, say w, x, y, z. These data are collected from 3 sources, say a, b and c. Each source may be missing some part of the data, so there are a maximum of 12 numbers at each row of the table.
Instead of creating 3 tables with 4 columns to store values from the different sources, and later merging the tables to fill in the missing fields, I am thinking of having a table that models the 4 data columns as 4 super columns or composite columns. Something like a:w, b:w, c:w, a:x, b:x, c:x, a:y, b:y, c:y, a:z, b:z, c:z. Additionally, every row has a timestamp as the primary key.
What I want to find out is whether I can have a query like SELECT *:w AS w FROM MyTable such that for every row, one value for x is returned from any source that is available (doesn't matter from which source). Although I want to also preserve the capability to retrieve data from a specific source, like SELECT a:w FROM MyTable.
----------------------------------------------------------------
| key | a:w | b:w | c:w | a:x | b:x | c:x | a:y | b:y | c:y | ...
----------------------------------------------------------------
| 1 | 10 | 10 | - | ....
| 2 | - | 1 | 2 | ....
| 3 | 11 | - | - | ....
| 4 | 12 | 11 | 11 | ....
-----------------------------------------------------------------
SELECT *:w AS w FROM MyTable
(10, 1, 11, 12) // would be an acceptable answer
SELECT a:w AS w FROM MyTable
(10, 11, 12) // would be an acceptable answer
Composite column is a vocabulary related to Thrift protocol. Internally, until Cassandra 2.2 the storage engine still deals with composite columns and translates them into clustering column, the new vocabulary that comes with CQL.
Since Cassandra 3.x, the storage engine has been rewritten so we no longer store data using composite columns. We align the storage engine with the new CQL semantics e.g. Partition key/clustering column. For backward compatibility we still translate clustering column back to composite column semantics when dealing with legacy Thrift protocol.
If you just start with Cassandra, forget about the old Thrift protocol and use right-away CQL semantics.
For your needs, the following schema should do the job:
CREATE TABLE my_data(
data text,
source text,
PRIMARY KEY ((data), source)
);
INSERT INTO my_data(data, source) VALUES('data1','src1');
INSERT INTO my_data(data, source) VALUES('data1','src2');
...
INSERT INTO my_data(data, source) VALUES('dataN','src1');
...
INSERT INTO my_data(data, source) VALUES('dataN','srcN');
//Select all sources for data1
SELECT source FROM my_data WHERE data='data1';
//Select data and source
SELECT * FROM my_data WHERE data='data1' AND source='src1';
Related
I have a simple question about using Cassandra as a key-value data store.
I am stil new to Casandra concepts and I have searched for an answer for this question in several places but, I could not find a solid answer for this.
Let's consider following scenario where I have set of sensor values to be stored in a database where the sensor names will be determined in runtime.
In scenario 1, I am getting following values
{"id": "001", "sensor1" : 25, "sensor2":15}
In scenario 2, I am getting following values,
{"id": "002", "sensor1" : 25, "sensor3":30}
So in these scenarios, If I use a key-value datastore such as AWS Dynamodb to store data then it would look like something as follows,
+-----+---------+---------+---------+
| ID | sensor1 | sensor2 | sensor3 |
+-----+---------+---------+---------+
| 001 | 25 | 15 | |
| 002 | 25 | | 30 |
+-----+---------+---------+---------+
But, my problem is can we accomplish the same thing with Cassandra since it is not complete key-value store.
In cassandra, I need to define table before I start persisting data into it. So in that case I need to know all of my sensor names prior to that.
I am not sure whether we can dynamically add columns to Cassandra table when we add data through 'insert' cql.
I am aware that I can store data as a map inside one of defined column in cassandra but, that it not what we do with key-value stores.
You need to put ID as partition key, and the sensor name as clustering column, something like:
create table sensors (
id int,
sensor text,
value int,
primary key(id, sensor));
then you could make queries like this (to select all data for given measurement):
select * from sensors where id = 001;
or his (to select data for given measurement and sensor):
select * from sensors where id = 001 and sensor = 'sensor1';
cqlsh create table:
CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
emp_city text,
emp_sal varint,
emp_phone varint
);
insert data
INSERT INTO emp (emp_id, emp_name, emp_city,
emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000);
select data
SELECT * FROM emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000
3 | Chennai | rahman | 9848022330 | 45000
looks just same as mysql, where is column family, column?
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp.
so table emp is a column family?
INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000); is a row which contains columns?
column here is something like emp_id=>1 or emp_name=>ram ??
In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time.
what does this mean?
I can have something like this?
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000 | asdfasd | asdfasdf
3 | Chennai | rahman | 9848022330 | 45000
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Where is super column, how to create it?
Column family is an old name, now it's called just table.
About super column, also an old term, you have "Map" data type for example, or user defined data types for more complex structures.
About freely adding columns - in the old days, Cassandra was working with unstructured data paradigm, so you didn't had to define columns before you insert them, for now it isn't possible, since Cassandra team moved to be "structured" only (as many in the DB's industry came to conclusion that unstructured data makes more problems than effort).
Anyway, Cassandra's data representation on storage level is very different from MySQL, and indeed saves only data for the columns that aren't empty. It may look same row when you are running select from cqlsh, but it is stored and queried in very different way.
The name column family is an old term for what's now simply called a table, such as "emp" in your example. Each table contains one or many columns, such as "emp_id", "emp_name".
When saying something like being able to freely add columns any time, this would mean that you're always able to omit values for columns (will be null) or add columns using the ALTER TABLE statement.
I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".
i am a newbie in cassandra. I am not sure about adding set of columns in a row many times, e.g., i want to add call related information columns (columns like timestamp_calling_no, timestamp_tower_id, timestamp_start_time, timestamp_end_time, timestamp_duration, timestamp_call_type,etc.) in a row whenever same mobile number make a call by using hector/astyanax/java/CQL.
please give your suggestions. Thanx in advance.
adding columns to a row is an inexpensive operation and one more thing cassandra stores column names as sorted, so by using timestamp as a column name will solve the problem of slicing.cassandra can have 2 billion columns in a row in the CDR CF, so we can easily keep adding columns.
you were trying to run a query which required cassandra to scan all rows then yes, it would perform poorly.
It's good that you recognise that there are a number of APIs available. I would recommend using CQL3, mostly because thrift apis are now being kept only for backwards compatibility and because CQL can be used with most languages while Astyanax and hector are java specific.
I'd use a compound key (under "Clustering, compound keys, and more").
//An example keyspace using CQL2
cqlsh>
CREATE KEYSPACE calls WITH strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor = 1;
//And next create a CQL3 table with a compound key
//Compound key is formed from the number and call's start time
cqlsh>
CREATE TABLE calls.calldata (
number text,
timestamp_start_time timestamp,
timestamp_end_time timestamp,
PRIMARY KEY (number, timestamp_start_time)
) WITH COMPACT STORAGE
The above schema would allow you to insert a row containing the same number as a key multiple times but because the start of the call is part of the key, it will ensure that the combination creates a unique key each time.
Next insert some test data using CQL3(for the purpose of the example of-course)
cqlsh> //This example data below uses 2 diffrent numbers
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361733545850, 1335361773545850);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+440987654321', 1335361734678700, 1335361737678700);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361738208700, 1335361738900032);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361740100277, 1335361740131251);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+440987654321', 1335361740176666, 1335361740213000);
And now we can retreive all the data (again using CQL3):
cqlsh> SELECT * FROM calls.calldata;
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+440987654321 | 44285-12-05 15:11:18+0000 | 44285-12-05 16:01:18+0000
+440987654321 | 44285-12-05 16:42:56+0000 | 44285-12-05 16:43:33+0000
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
+441234567890 | 44285-12-05 16:10:08+0000 | 44285-12-05 16:21:40+0000
+441234567890 | 44285-12-05 16:41:40+0000 | 44285-12-05 16:42:11+0000
Or part of of the data. Because a compound key is used you can retrieve all the rows of a specific number using CQL3:
cqlsh> SELECT * FROM calls.calldata WHERE number='+441234567890';
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
+441234567890 | 44285-12-05 16:10:08+0000 | 44285-12-05 16:21:40+0000
+441234567890 | 44285-12-05 16:41:40+0000 | 44285-12-05 16:42:11+0000
And if you want to be really specific, you can retrieve a specific number where the call started at a specific time (again thanks to the compound key)
cqlsh> SELECT * FROM calls.calldata WHERE number='+441234567890'
and timestamp_start_time=1335361733545850;
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
In frameworks like Playorm, there are multiple ways to do that. For e.g. you can use #OneToMany OR #NoSqlEmbedded pattern. For more details, visit http://buffalosw.com/wiki/Patterns-Page/
I am pretty new to NoSQL and Cassandra but I was told by my architecture committee to use this. I just want to understand how to convert the RDBMS model to noSQL.
I have a database where user needs to import data from an excel or csv file into the database. This file may have different columns each time.
For example in the excel file data might look something like this:
Name| AName| Industry| Interest | Pint |Start Date | End date
x | 111-121 | IT | 2 | 1/1/2011 | 1/2/2011
x | 111-122 | hotel | 1 | "" | ""
y| 111-1000 | IT | 2 | 1/1/2011 | 1/2/2011
After we upload this the next excel file might look
Name| AName| Industry| Interest | Pint |Start Date | isTrue | isNegative
x | 111-121 | IT | 2 | 1/1/2011 | 1/2/2011 | yes | no
x | 111-122 | hotel | 1 | "" | no | no
y| 111-1000 |health | 2 | 1/1/2010 | yes|""
I would not know in advance what columns I am going to create when importing data. I am totally confused with noSQL and unable to understand how handle this on how to import data when I don't know the table structure
Start with the basic fact that a column family (cassandra for "table") is made up of rows. Each row has a row key and some number of key/value pairs (called columns). For a particular column in a row the name of the column is the key for the pair and the value of the column is the value of the pair. Just because you have a column by some name in one row does not necessarily mean you'll have a column by that name in any other row.
Internally, row keys, column names and column values are stored as byte arrays and you'll need to use serializers to convert program data to the byte arrays and back again.
It's up to you as to how you define the row key, column name and column value.
One approach would be to have a row in the CF correspond to a row from Excel. You'd have to identify the one Excel column that will provide a unique id and store that in the row key. The remained of the Excel columns can get stored in cassandra columns, one-to-one. This lets you be very flexible on most column names, but you have to have a unique key value somewhere. The unique key requirement will always hold for any storage scheme you use.
There are other storage schemes, but they all boil down to you defining in the Excel what your row key is and how you break the Excel data into key/value pairs.
Check out some noSQL patterns and I highly suggest reading "Building on Quicksand" by Pat Helland
some good patterns(with or without using PlayOrm)...
http://buffalosw.com/wiki/Patterns-Page/