Column oriented database related - cassandra

Folks,
I am currently have started reading about NOSQL related DB as currently working on Database warehousing related application.
I have following questions. I have already read basics.
Question 1) How entire raw is retrived in column oriented database as data with same column is stored together ?
lets say we store data in following format so internally it will be stored like this in column oriented DB.
test|test1 together and 5|10 together.
key 1 : { name : test, value : 5 }
key 2 : { name : test1 , value : 10 }
So if we have to retrive data for key1 how does it happen ? (A and B is my guess)
A) If it has to pick data from each column storage seperately then it will be very costly
B) is there any indexing mechanism to fetch this data for all columns for given raw key ?
Question 2 )
I was reading through some of the docs and found column oriented Database is more suited to run aggregation function on single column as I/O will be less.
I didnot find proper support for aggregation function like SUM,AVG etc in NOSQL column oriented store like cassandra and HBASE. ( There could be some tweaking/hacking/more code writing like below)
How does Apache Cassandra do aggregate operations?
realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
How to use hbase coprocessor to implement groupby?
Question 3 ) How the joins happens internally in column oriented database is it advisable to do ?

Nice question,
1) In Cassandra if you are using cqlsh then it will look like as you store data in mysql or some other rdbms stores.
Connected to Test Cluster at localhost:9160.
[cqlsh 3.1.7 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', 'replication_factor': 1
<value>
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', replication_factor': 1};
cqlsh> USE test ;
cqlsh:test> create table entry(key text PRIMARY KEY, name text, value int );
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key1', 'test',5);
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key2', 'test1',10);
cqlsh:test> select * from entry;
key | name | value
------+-------+-------
key1 | test | 5
key2 | test1 | 10
cqlsh:test>
Note:- you can select rows using key or using some criteria on other column by using secondary indexes.
But in hbase the structure will look like following
rowkey | column family | column | value
key1 | entry | name | test
key1 | entry | value | 5
key2 | entry | name | test1
key2 | entry | value | 10
Note:- you can select each row using key or any column value its very easy.
2) Yes nosqls also supports batch operation only for DMLs.
3) Joins are not supported in none of nosqls datastores. They are not meant for joins.
Hope it will help you.

Related

Can we use Cassandra as a Key-Value Store?

I have a simple question about using Cassandra as a key-value data store.
I am stil new to Casandra concepts and I have searched for an answer for this question in several places but, I could not find a solid answer for this.
Let's consider following scenario where I have set of sensor values to be stored in a database where the sensor names will be determined in runtime.
In scenario 1, I am getting following values
{"id": "001", "sensor1" : 25, "sensor2":15}
In scenario 2, I am getting following values,
{"id": "002", "sensor1" : 25, "sensor3":30}
So in these scenarios, If I use a key-value datastore such as AWS Dynamodb to store data then it would look like something as follows,
+-----+---------+---------+---------+
| ID | sensor1 | sensor2 | sensor3 |
+-----+---------+---------+---------+
| 001 | 25 | 15 | |
| 002 | 25 | | 30 |
+-----+---------+---------+---------+
But, my problem is can we accomplish the same thing with Cassandra since it is not complete key-value store.
In cassandra, I need to define table before I start persisting data into it. So in that case I need to know all of my sensor names prior to that.
I am not sure whether we can dynamically add columns to Cassandra table when we add data through 'insert' cql.
I am aware that I can store data as a map inside one of defined column in cassandra but, that it not what we do with key-value stores.
You need to put ID as partition key, and the sensor name as clustering column, something like:
create table sensors (
id int,
sensor text,
value int,
primary key(id, sensor));
then you could make queries like this (to select all data for given measurement):
select * from sensors where id = 001;
or his (to select data for given measurement and sensor):
select * from sensors where id = 001 and sensor = 'sensor1';

Cassandra CLUSTERING ORDER BY is not working and showing in correct results

Hi I have created a table for storing data of like this
CREATE TABLE keyspace.test (
name text,
date text,
time double,
entry text,
details text,
PRIMARY KEY ((name, date), time)
) WITH CLUSTERING ORDER BY (time DESC);
And inserted data into the table.But a query like this gives an unordered result.
SELECT * FROM keyspace.test where device_id name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Is there any problem with my table design.
I think you are misunderstanding cassandra clustering key order. Cassandra Sort data with cluster key within a single partition.
That is for your case cassandra sort data with clustering key time within a single name and date.
Example : Let's insert some data
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 1, 'a');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 2, 'b');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 3, 'c');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 0, 'nil');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 4, 'd');
If we select data with your query :
SELECT * FROM test where name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Output :
name | date | time | details | entry
-------+------------+------+---------+-------
anand | 2017-04-01 | 3 | null | c
anand | 2017-04-01 | 2 | null | b
anand | 2017-04-01 | 1 | null | a
anand | 2017-04-02 | 4 | null | d
anand | 2017-04-02 | 0 | null | nil
You can see that time 3,2,1 are within a single partition anand:2017-04-01 are sorted in desc And time 4,0 are within single partition anand:2017-04-02 are sorted in desc. Cassandra will not take care of sorting between different partition.
Here is the doc :
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is the position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
Source : http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
By the way why is your data field is text type and time field is double type ?
You can use date field as date type and time as timestamp type.
The query that you are using is o.k. but it probably doesn't behave as you are expecting it to because coordinator will not sort the results based on partitions. I also run into this problem couple of times.
The solution to it is very simple, basically It's far better to execute the 4 separate queries that you need on the client and then merge the results there. In short IN operator puts a lot of pressure to the coordinator node in the cluster, there's a nice read on this subject:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Duplicate rows/columns for the same primary key in Cassandra

I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".

Can an index be created on a UUID Column?

Is it possible to create an index on a UUID/TIMEUUID column in Cassandra? I'm testing out a model design which would have an index on a UUID column, but queries on that column always return 0 rows found.
I have a table like this:
create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
I create an index with this command:
create index idx on some_data (run_id) ;
No errors are thrown by CQL when I create this index.
I have a small bit of test data in the table:
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-----------------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
However, when I run the query:
select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
CQLSH just returns: (0 rows)
If I use an int for the run_id then the index behaves as expected.
Yes, you can create a secondary index on a UUID. The real question is "should you?"
In any case, I followed your steps, and got it to work.
Connected to Test Cluster at 192.168.23.129:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
aploetz#cqlsh> use stackoverflow ;
aploetz#cqlsh:stackoverflow> create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
aploetz#cqlsh:stackoverflow> create index idx on some_data (run_id) ;
aploetz#cqlsh:stackoverflow> INSERT INTO some_data (site_id, user_id, run_id, value) VALUES (1,1,9e118af0-ac92-11e4-81ae-8d1bc921f26d,3);
aploetz#cqlsh:stackoverflow> select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
code=2200 [Invalid query] message="unconfigured columnfamily usr_rec3"
aploetz#cqlsh:stackoverflow> select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
(1 rows)
Notice though, that when I ran this command, it failed:
select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
Are you sure that you didn't mean to select from some_data instead?
Also, creating secondary indexes on high-cardinality columns (like a UUID) is generally not a good idea. If you need to query by run_id, then you should revisit your data model and come up with an appropriate query table to serve that.
Clarification:
Using secondary indexes in general is not considered good practice. In the new book Cassandra High Availability, Robbie Strickland identifies their use as an anti-pattern, due to poor performance.
Just because a column is of the UUID data type doesn't necessarily make it high-cardinality. That's more of a data model question for you. But knowing the nature of UUIDs and their underlying purpose toward being unique, is setting off red flags.
Put these two points together, and there isn't anything about creating an index on a UUID that sounds appealing to me. If it were my cluster, and (more importantly) I had to support it later, I wouldn't do it.

Cassandra: can you add dynamic columns within existing column clustering?

I'm using Cassandra 1.2.12 with CQL 3, and am having trouble modeling my column family.
I currently store snapshots of customer data at particular times. Works great:
CREATE TABLE data (
cust_id varchar,
time timeuuid,
data_text text,
PRIMARY KEY (cust_id, time)
);
The cust_id is the partition key and time is the clustering id, so, as I understand it, I can think of each row in the table like:
| cust_id | timeuuid1 : data_text | timeuuid2 : data_text |
| CUST1 | data at this time | data at this time |
Now I'd like to store another group of metrics for each snapshot - but the name of each of these columns isn't fixed. So something like:
| cust_id | timeuuid1 : data_text | timeuuid1 : dynamicCol1 | timeuuid1 : dynamicCol2 | timeuuid1 : dynamicColN |
| CUST1 | data |{some value} |{some value} |{some value} |
I've achieved dynamic columns for timestamp by using a composite primary key, but I can't see how to achieve this within each cluster of columns, if you see what I mean.
If I add, say, "dynamicColumnName" to the existing composite key, I'll end up with customer data stored for each dynamic column, which is not what I want.
Is this possible, without using a Map column? Hope you can help, thanks!
I am not a CQL user... With the thrift API you dynamically add a column to a column family by inserting/updating a record with a value for a column with name X. The column X will start to exist right there and then for that record.
Have you tried an INSERT statement specifying a column that you have not explicitly defined? I would expect that to have the same effect (column is created).

Resources