Duplicate rows/columns for the same primary key in Cassandra - cassandra

I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)

I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...

Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3

"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".

Related

Cassandra find records where list is empty [duplicate]

How do I query in cassandra for != null columns.
Select * from tableA where id != null;
Select * from tableA where name != null;
Then I wanted to store these values and insert these into different table.
I don't think this is possible with Cassandra. First of all, Cassandra CQL doesn't support the use of NOT or not equal to operators in the WHERE clause. Secondly, your WHERE clause can only contain primary key columns, and primary key columns will not allow null values to be inserted. I wasn't sure about secondary indexes though, so I ran this quick test:
create table nullTest (id text PRIMARY KEY, name text);
INSERT INTO nullTest (id,name) VALUES ('1','bob');
INSERT INTO nullTest (id,name) VALUES ('2',null);
I now have a table and two rows (one with null data):
SELECT * FROM nullTest;
id | name
----+------
2 | null
1 | bob
(2 rows)
I then try to create a secondary index on name, which I know contains null values.
CREATE INDEX nullTestIdx ON nullTest(name);
It lets me do it. Now, I'll run a query on that index.
SELECT * FROM nullTest WHERE name=null;
Bad Request: Unsupported null value for indexed column name
And again, this is done under the premise that you can't query for not null, if you can't even query for column values that may actually be null.
So, I'm thinking this can't be done. Also, if null values are a possibility in your primary key, then you may want to re-evaluate your data model. Again, I know the OP's question is about querying where data is not null. But as I mentioned before, Cassandra CQL doesn't have a NOT or != operator, so that's going to be a problem right there.
Another option, is to insert an empty string instead of a null. You would then be able to query on an empty string. But that still doesn't get you past the fundamental design flaw of having a null in a primary key field. Perhaps if you had a composite primary key, and only part of it (the clustering columns) had the possibility of being empty (certainly not part of the partitioning key). But you'd still be stuck with the problem of not being able to query for rows that are "not empty" (instead of not null).
NOTE: Inserting null values was done here for demonstration purposes only. It is something you should do your best to avoid, as inserting a null column value WILL create a tombstone. Likewise, inserting lots of null values will create lots of tombstones.
1) select * from test;
name | id | address
------------------+----+------------------
bangalore | 3 | ramyam_lab
bangalore | 4 | bangalore_ramyam
bangalore | 5 | jasgdjgkj
prasad | 11 | null
prasad | 12 | null
india | 6 | karnata
india | 7 | karnata
ramyam-bangalore | 3 | jasgdjgkj
ramyam-bangalore | 5 | jasgdjgkj
2)cassandra does't support null values selection.It is showing null for our understanding.
3) For handling null values use another strings like "not-available","null",then we can select data

Can we use Cassandra as a Key-Value Store?

I have a simple question about using Cassandra as a key-value data store.
I am stil new to Casandra concepts and I have searched for an answer for this question in several places but, I could not find a solid answer for this.
Let's consider following scenario where I have set of sensor values to be stored in a database where the sensor names will be determined in runtime.
In scenario 1, I am getting following values
{"id": "001", "sensor1" : 25, "sensor2":15}
In scenario 2, I am getting following values,
{"id": "002", "sensor1" : 25, "sensor3":30}
So in these scenarios, If I use a key-value datastore such as AWS Dynamodb to store data then it would look like something as follows,
+-----+---------+---------+---------+
| ID | sensor1 | sensor2 | sensor3 |
+-----+---------+---------+---------+
| 001 | 25 | 15 | |
| 002 | 25 | | 30 |
+-----+---------+---------+---------+
But, my problem is can we accomplish the same thing with Cassandra since it is not complete key-value store.
In cassandra, I need to define table before I start persisting data into it. So in that case I need to know all of my sensor names prior to that.
I am not sure whether we can dynamically add columns to Cassandra table when we add data through 'insert' cql.
I am aware that I can store data as a map inside one of defined column in cassandra but, that it not what we do with key-value stores.
You need to put ID as partition key, and the sensor name as clustering column, something like:
create table sensors (
id int,
sensor text,
value int,
primary key(id, sensor));
then you could make queries like this (to select all data for given measurement):
select * from sensors where id = 001;
or his (to select data for given measurement and sensor):
select * from sensors where id = 001 and sensor = 'sensor1';

Cassandra compound clustering key and queries with ordering

We use cassandra wide rows heavily to store per user time-series as they are perfect for that use-case. Let's assume we have a table:
create table user_events (
user_id text,
timestmp timestamp,
event text,
primary key((user_id), timestmp));
What if clashes on timestamp may happen (same user can emit two different events with the same timestamp). What is the best way to tweak this schema to resolve that assuming we have an ordering for all events present (have a sequence int for each event).
If I modify schema the following way:
create table user_events (
user_id text,
timestmp timestamp,
seq int,
event text,
primary key((user_id), timestmp, seq));
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
You might be seeing an error because you are repeating ASC. This should work:
WHERE user_id = ? ORDER BY timestmp,seq ASC
Also, as long as you have defined your primary key as PRIMARY KEY((user_id),timestmp,seq)) you don't even need to specify ORDER BY x[,y] ASC. It will cluster the data on disk in that order, and thus return it to you already sorted in that order. ORDER BY should only be necessary when you want to put your results in descending order (or whatever the opposite of how you have it defined is).
What if clashes on timestamp may happen?
I think your extra seq column should be sufficient, depending on how you plan on inserting the data. If you are setting the timestmp from the client, then you should be ok. However, look what happens when I (using your second table) INSERT rows while creating the timestamp two different ways.
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Mal',dateof(now()),1,'commanding');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Wash',dateof(now()),1,'piloting');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),2,'killing reavers');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',2,'killing reavers');
Querying that data by a user_id of "River" yields:
aploetz#cqlsh:stackoverflow> SELECT * FROM user_events WHERE user_id='River';
user_id | timestmp | seq | event
---------+--------------------------+-----+-----------------
River | 2015-01-13 13:14:00-0600 | 1 | freaking out
River | 2015-01-13 13:14:00-0600 | 2 | killing reavers
River | 2015-01-13 13:14:00-0600 | 3 | being weird
River | 2015-01-14 12:58:41-0600 | 1 | freaking out
River | 2015-01-14 12:58:57-0600 | 3 | being weird
River | 2015-01-14 12:58:57-0600 | 2 | killing reavers
(6 rows)
Notice that using the now() function to generate a timeuuid, and then converting that to a timestamp with dateof() causes the two rows with the timestmp "2015-01-14 12:58:57-0600" to appear to be the same. But they are not the same, as you can tell by the seq column.
So just a bit of caution on using/generating timestamps. They might look the same, but they may not be stored as the same value. Just to be on the safe side, I would use a timeuuid instead.

Column oriented database related

Folks,
I am currently have started reading about NOSQL related DB as currently working on Database warehousing related application.
I have following questions. I have already read basics.
Question 1) How entire raw is retrived in column oriented database as data with same column is stored together ?
lets say we store data in following format so internally it will be stored like this in column oriented DB.
test|test1 together and 5|10 together.
key 1 : { name : test, value : 5 }
key 2 : { name : test1 , value : 10 }
So if we have to retrive data for key1 how does it happen ? (A and B is my guess)
A) If it has to pick data from each column storage seperately then it will be very costly
B) is there any indexing mechanism to fetch this data for all columns for given raw key ?
Question 2 )
I was reading through some of the docs and found column oriented Database is more suited to run aggregation function on single column as I/O will be less.
I didnot find proper support for aggregation function like SUM,AVG etc in NOSQL column oriented store like cassandra and HBASE. ( There could be some tweaking/hacking/more code writing like below)
How does Apache Cassandra do aggregate operations?
realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
How to use hbase coprocessor to implement groupby?
Question 3 ) How the joins happens internally in column oriented database is it advisable to do ?
Nice question,
1) In Cassandra if you are using cqlsh then it will look like as you store data in mysql or some other rdbms stores.
Connected to Test Cluster at localhost:9160.
[cqlsh 3.1.7 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', 'replication_factor': 1
<value>
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', replication_factor': 1};
cqlsh> USE test ;
cqlsh:test> create table entry(key text PRIMARY KEY, name text, value int );
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key1', 'test',5);
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key2', 'test1',10);
cqlsh:test> select * from entry;
key | name | value
------+-------+-------
key1 | test | 5
key2 | test1 | 10
cqlsh:test>
Note:- you can select rows using key or using some criteria on other column by using secondary indexes.
But in hbase the structure will look like following
rowkey | column family | column | value
key1 | entry | name | test
key1 | entry | value | 5
key2 | entry | name | test1
key2 | entry | value | 10
Note:- you can select each row using key or any column value its very easy.
2) Yes nosqls also supports batch operation only for DMLs.
3) Joins are not supported in none of nosqls datastores. They are not meant for joins.
Hope it will help you.

Timestamp / date as key for cassandra column family / hector

I have to create and query a column family with composite key as [timestamp,long]. Also,
while querying I want to fire range query for timestamp (like timestamp between xxx and yyy) Is this possible ?
Currently I am doing something really funny (Which I know its not correct). I create keys with timestamp string for given range and concatenate with long.
like ,
1254345345435-1234
3423432423432-1234
1231231231231-9999
and pass set of keys to hector api. (so if i have date range for 1 month and I want every minute data, i create 30 * 24 * 60 * [number of secondary key - long])
I can solve concatenation issue with composite key. But query part is what I am trying to understand.
As far as I understood, As we are using RandomPartitioner we cannot really query based on range as keys are MD5 checksum. Whats ideal design for this kind of use case ?
my schema and requirements are as follows : (actual csh)
CREATE TABLE report(
ts timestamp,
user_id long,
svc1 long,
svc2 long,
svc3 long,
PRIMARY KEY(ts, user_id));
select from report where ts between (123445345435 and 32423423424) and user_id is in (123,567,987)
You cannot do range queries on the first component of a composite key. Instead, you should write a sentinel value such as a daystamp (the unix epoch at midnight on the current day) as the key, then write a composite column as timestamp:long. This way you can provide the keys that comprise your range, and slice on the timestamp component of the composite column.
Denormalize! You must model your schema in a manner that will enable the types of queries you wish to perform. We create a reverse (aka inverted, inverse) index for such scenarios.
CREATE TABLE report(
KEY uuid PRIMARY KEY,
svc1 bigint,
svc2 bigint,
svc3 bigint
);
CREATE TABLE ReportsByTime(
KEY ascii PRIMARY KEY
) with default_validation=uuid AND comparator=uuid;
CREATE TABLE ReportsByUser(
KEY bigint PRIMARY KEY
)with default_validation=uuid AND comparator=uuid;
See here for a nice explanation. What you are doing now is generating your own ascii key in the times table, to enable yourself to perform the range slice query you want - it doesn't have to be ascii though just something you can use to programmatically generate your own slice keys with.
You can use this approach to facilitate all of your queries, this likely isn't going to suit your application directly but the idea is the same. You can squeeze more out of this by adding meaningful values to the column keys of each table above.
cqlsh:tester> select * from report;
KEY | svc1 | svc2 | svc3
--------------------------------------+------+------+------
1381b530-1dd2-11b2-0000-242d50cf1fb5 | 332 | 333 | 334
13818e20-1dd2-11b2-0000-242d50cf1fb5 | 222 | 223 | 224
13816710-1dd2-11b2-0000-242d50cf1fb5 | 112 | 113 | 114
cqlsh:tester> select * from times;
KEY,1212051037 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5 | 1381b530-1dd2-11b2-0000-242d50cf1fb5,1381b530-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051035 | 13816710-1dd2-11b2-0000-242d50cf1fb5,13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051036 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
cqlsh:tester> select * from users;
KEY | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
-------------+--------------------------------------+--------------------------------------
23123123231 | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
Why don't you use wide rows, where Key is timestamp and Column Name as Long-Value then you can pass multiple key's (timestamp's) to getKeySlice and select multiple column's to withColumnSlice by there name (which is id).
As I don't know what is column name and value, I feel this can help you. Can you provide more details of your column family definition.

Resources