i am a newbie in cassandra. I am not sure about adding set of columns in a row many times, e.g., i want to add call related information columns (columns like timestamp_calling_no, timestamp_tower_id, timestamp_start_time, timestamp_end_time, timestamp_duration, timestamp_call_type,etc.) in a row whenever same mobile number make a call by using hector/astyanax/java/CQL.
please give your suggestions. Thanx in advance.
adding columns to a row is an inexpensive operation and one more thing cassandra stores column names as sorted, so by using timestamp as a column name will solve the problem of slicing.cassandra can have 2 billion columns in a row in the CDR CF, so we can easily keep adding columns.
you were trying to run a query which required cassandra to scan all rows then yes, it would perform poorly.
It's good that you recognise that there are a number of APIs available. I would recommend using CQL3, mostly because thrift apis are now being kept only for backwards compatibility and because CQL can be used with most languages while Astyanax and hector are java specific.
I'd use a compound key (under "Clustering, compound keys, and more").
//An example keyspace using CQL2
cqlsh>
CREATE KEYSPACE calls WITH strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor = 1;
//And next create a CQL3 table with a compound key
//Compound key is formed from the number and call's start time
cqlsh>
CREATE TABLE calls.calldata (
number text,
timestamp_start_time timestamp,
timestamp_end_time timestamp,
PRIMARY KEY (number, timestamp_start_time)
) WITH COMPACT STORAGE
The above schema would allow you to insert a row containing the same number as a key multiple times but because the start of the call is part of the key, it will ensure that the combination creates a unique key each time.
Next insert some test data using CQL3(for the purpose of the example of-course)
cqlsh> //This example data below uses 2 diffrent numbers
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361733545850, 1335361773545850);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+440987654321', 1335361734678700, 1335361737678700);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361738208700, 1335361738900032);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361740100277, 1335361740131251);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+440987654321', 1335361740176666, 1335361740213000);
And now we can retreive all the data (again using CQL3):
cqlsh> SELECT * FROM calls.calldata;
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+440987654321 | 44285-12-05 15:11:18+0000 | 44285-12-05 16:01:18+0000
+440987654321 | 44285-12-05 16:42:56+0000 | 44285-12-05 16:43:33+0000
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
+441234567890 | 44285-12-05 16:10:08+0000 | 44285-12-05 16:21:40+0000
+441234567890 | 44285-12-05 16:41:40+0000 | 44285-12-05 16:42:11+0000
Or part of of the data. Because a compound key is used you can retrieve all the rows of a specific number using CQL3:
cqlsh> SELECT * FROM calls.calldata WHERE number='+441234567890';
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
+441234567890 | 44285-12-05 16:10:08+0000 | 44285-12-05 16:21:40+0000
+441234567890 | 44285-12-05 16:41:40+0000 | 44285-12-05 16:42:11+0000
And if you want to be really specific, you can retrieve a specific number where the call started at a specific time (again thanks to the compound key)
cqlsh> SELECT * FROM calls.calldata WHERE number='+441234567890'
and timestamp_start_time=1335361733545850;
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
In frameworks like Playorm, there are multiple ways to do that. For e.g. you can use #OneToMany OR #NoSqlEmbedded pattern. For more details, visit http://buffalosw.com/wiki/Patterns-Page/
Related
cqlsh create table:
CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
emp_city text,
emp_sal varint,
emp_phone varint
);
insert data
INSERT INTO emp (emp_id, emp_name, emp_city,
emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000);
select data
SELECT * FROM emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000
3 | Chennai | rahman | 9848022330 | 45000
looks just same as mysql, where is column family, column?
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp.
so table emp is a column family?
INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000); is a row which contains columns?
column here is something like emp_id=>1 or emp_name=>ram ??
In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time.
what does this mean?
I can have something like this?
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000 | asdfasd | asdfasdf
3 | Chennai | rahman | 9848022330 | 45000
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Where is super column, how to create it?
Column family is an old name, now it's called just table.
About super column, also an old term, you have "Map" data type for example, or user defined data types for more complex structures.
About freely adding columns - in the old days, Cassandra was working with unstructured data paradigm, so you didn't had to define columns before you insert them, for now it isn't possible, since Cassandra team moved to be "structured" only (as many in the DB's industry came to conclusion that unstructured data makes more problems than effort).
Anyway, Cassandra's data representation on storage level is very different from MySQL, and indeed saves only data for the columns that aren't empty. It may look same row when you are running select from cqlsh, but it is stored and queried in very different way.
The name column family is an old term for what's now simply called a table, such as "emp" in your example. Each table contains one or many columns, such as "emp_id", "emp_name".
When saying something like being able to freely add columns any time, this would mean that you're always able to omit values for columns (will be null) or add columns using the ALTER TABLE statement.
I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".
When I try to find information on composite columns, I cannot find anything newer than 2013 (specifically this one is Google's top link, which has no CQL code when talking about using the composite columns, and apparently uses a very old Java driver). Do composite columns still exist in newer versions of Cassandra? I mean, apart from having a composite key.
I am new to Cassandra and actually want to learn if they are suitable for my use-case, described in the following. Consider a table with 4 double-valued columns, say w, x, y, z. These data are collected from 3 sources, say a, b and c. Each source may be missing some part of the data, so there are a maximum of 12 numbers at each row of the table.
Instead of creating 3 tables with 4 columns to store values from the different sources, and later merging the tables to fill in the missing fields, I am thinking of having a table that models the 4 data columns as 4 super columns or composite columns. Something like a:w, b:w, c:w, a:x, b:x, c:x, a:y, b:y, c:y, a:z, b:z, c:z. Additionally, every row has a timestamp as the primary key.
What I want to find out is whether I can have a query like SELECT *:w AS w FROM MyTable such that for every row, one value for x is returned from any source that is available (doesn't matter from which source). Although I want to also preserve the capability to retrieve data from a specific source, like SELECT a:w FROM MyTable.
----------------------------------------------------------------
| key | a:w | b:w | c:w | a:x | b:x | c:x | a:y | b:y | c:y | ...
----------------------------------------------------------------
| 1 | 10 | 10 | - | ....
| 2 | - | 1 | 2 | ....
| 3 | 11 | - | - | ....
| 4 | 12 | 11 | 11 | ....
-----------------------------------------------------------------
SELECT *:w AS w FROM MyTable
(10, 1, 11, 12) // would be an acceptable answer
SELECT a:w AS w FROM MyTable
(10, 11, 12) // would be an acceptable answer
Composite column is a vocabulary related to Thrift protocol. Internally, until Cassandra 2.2 the storage engine still deals with composite columns and translates them into clustering column, the new vocabulary that comes with CQL.
Since Cassandra 3.x, the storage engine has been rewritten so we no longer store data using composite columns. We align the storage engine with the new CQL semantics e.g. Partition key/clustering column. For backward compatibility we still translate clustering column back to composite column semantics when dealing with legacy Thrift protocol.
If you just start with Cassandra, forget about the old Thrift protocol and use right-away CQL semantics.
For your needs, the following schema should do the job:
CREATE TABLE my_data(
data text,
source text,
PRIMARY KEY ((data), source)
);
INSERT INTO my_data(data, source) VALUES('data1','src1');
INSERT INTO my_data(data, source) VALUES('data1','src2');
...
INSERT INTO my_data(data, source) VALUES('dataN','src1');
...
INSERT INTO my_data(data, source) VALUES('dataN','srcN');
//Select all sources for data1
SELECT source FROM my_data WHERE data='data1';
//Select data and source
SELECT * FROM my_data WHERE data='data1' AND source='src1';
We use cassandra wide rows heavily to store per user time-series as they are perfect for that use-case. Let's assume we have a table:
create table user_events (
user_id text,
timestmp timestamp,
event text,
primary key((user_id), timestmp));
What if clashes on timestamp may happen (same user can emit two different events with the same timestamp). What is the best way to tweak this schema to resolve that assuming we have an ordering for all events present (have a sequence int for each event).
If I modify schema the following way:
create table user_events (
user_id text,
timestmp timestamp,
seq int,
event text,
primary key((user_id), timestmp, seq));
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
You might be seeing an error because you are repeating ASC. This should work:
WHERE user_id = ? ORDER BY timestmp,seq ASC
Also, as long as you have defined your primary key as PRIMARY KEY((user_id),timestmp,seq)) you don't even need to specify ORDER BY x[,y] ASC. It will cluster the data on disk in that order, and thus return it to you already sorted in that order. ORDER BY should only be necessary when you want to put your results in descending order (or whatever the opposite of how you have it defined is).
What if clashes on timestamp may happen?
I think your extra seq column should be sufficient, depending on how you plan on inserting the data. If you are setting the timestmp from the client, then you should be ok. However, look what happens when I (using your second table) INSERT rows while creating the timestamp two different ways.
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Mal',dateof(now()),1,'commanding');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Wash',dateof(now()),1,'piloting');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),2,'killing reavers');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',2,'killing reavers');
Querying that data by a user_id of "River" yields:
aploetz#cqlsh:stackoverflow> SELECT * FROM user_events WHERE user_id='River';
user_id | timestmp | seq | event
---------+--------------------------+-----+-----------------
River | 2015-01-13 13:14:00-0600 | 1 | freaking out
River | 2015-01-13 13:14:00-0600 | 2 | killing reavers
River | 2015-01-13 13:14:00-0600 | 3 | being weird
River | 2015-01-14 12:58:41-0600 | 1 | freaking out
River | 2015-01-14 12:58:57-0600 | 3 | being weird
River | 2015-01-14 12:58:57-0600 | 2 | killing reavers
(6 rows)
Notice that using the now() function to generate a timeuuid, and then converting that to a timestamp with dateof() causes the two rows with the timestmp "2015-01-14 12:58:57-0600" to appear to be the same. But they are not the same, as you can tell by the seq column.
So just a bit of caution on using/generating timestamps. They might look the same, but they may not be stored as the same value. Just to be on the safe side, I would use a timeuuid instead.
I recently created a keyspace and a column family in cassandra. I have the following
CREATE TABLE reports (
id timeuuid PRIMARY KEY,
report varchar
)
I want to select the report according to a range of time. so my query is the following;
select dateOf(id), id
from keyspace.reports
where token(id) > token(maxTimeuuid('2013-07-16 16:10:48+0300'));
It returns;
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
So, it's wrong.
When I try to use the following cql;
select dateOf(id), id from keyspace.reports
where token(id) > token(minTimeuuid('2013-07-16 16:12:48+0300'));
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
select dateOf(id), id from keyspace.reports
where token(id) > token(minTimeuuid('2013-07-16 16:13:48+0300'));
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
Is it random ? Why isn't it giving meaningful outputs ?
What's the best solution for this in cassandra ?
You are using the token function, which isn't really useful in your context (querying between times using mintimeuuid and maxtimeuuid) and is generating random-looking, and incorrect output:
From the CQL documentation:
The TOKEN function can be used with a condition operator on the partition key column to query. The query selects rows based on the token of their partition key rather than on their value. The token of a key depends on the partitioner in use. The RandomPartitioner and Murmur3Partitioner do not yield a meaningful order.
If you are looking to retrieve based on all records between two dates it might make more sense to model your data as a wide row, with one record per column, rather than one record per row, e.g., creating the table:
CREATE TABLE reports (
reportname text,
id timeuuid,
report text,
PRIMARY KEY (reportname, id)
)
, populating the data:
insert into reports2(reportname,id,report) VALUES ('report', 1b3f6d00-ee19-11e2-8734-8d331d938752, 'a');
insert into reports2(reportname,id,report) VALUES ('report', 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b, 'b');
insert into reports2(reportname,id,report) VALUES ('report', 1b275870-ee19-11e2-b3f3-af3e3057c60f, 'c');
insert into reports2(reportname,id,report) VALUES ('report', 21f9a390-ee19-11e2-89a2-97143e6cae9e, 'd');
, and querying (no token calls!):
select dateOf(id),id from reports2 where reportname='report' and id>maxtimeuuid('2013-07-16 16:10:48+0300');
, which returns the expected result:
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 14:10:48+0100 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
The downside to this is that all of your reports are in the one row, of course you can now store lots of different reports (keyed by reportname here). To get all reports called mynewreport in August 2013 you could query using:
select dateOf(id),id from reports2 where reportname='mynewreport' and id>=mintimeuuid('2013-08-01+0300') and id<mintimeuuid('2013-09-01+0300');