Cassandra returns Unordered result set for numeric values - cassandra

I am new to No SQL and just started learning Cassandra, I have a following question to ask. I have created a simple table with one column to understand Cassandra partition and clustering and trying to query all the values after insertion.
My table structure
create table if not exists music_library(custno int, primary key(custno))
I inserted following values in a sequential order
insert into music_library(custno) values (11)
insert into music_library(custno) values (12)
insert into music_library(custno) values (13)
insert into music_library(custno) values (14)
then I was querying this table
select * from music_library
it returns values in the following order
13
11
14
12
but i was expecting
11
12
13
14
Why its behaving like that?

I ran your exact statements and produced the same result. But I also adjusted your query to run the token function, and this is what it produced:
aaron#cqlsh:stackoverflow> select custno,token(custno) from music_library;
custno | system.token(custno)
--------+----------------------
13 | -5034495173465742853
11 | -4156302194539278891
14 | 4279681877540623768
12 | 8582886034424406875
(4 rows)
Why its behaving like that?
Simply put, because Cassandra cannot order results by the values of the partition keys.
As your table has a single primary key of custno, your rows are partitioned by the hashed token value of custno, and written to the nodes responsible for those token ranges. When you run an unbound query in Cassandra (query without a WHERE clause), the results are returned ordered by the hashed token values of their partition keys.
Using ORDER BY won't work here, either. ORDER BY can only sort data within a partition, and even then only on clustering keys. To get the custno values to order properly, you will need to find a new partition key, and then specify custno as a clustering key in an ascending direction.
Edit 20190916 - follow-up clarifications
Does this tokenization will happen for all the columns?
No. The partition keys are hashed into a token to determine their placement in the cluster (which node(s) they are written to). Individual column values are written within a partition.
How will I return the inserted number with the order?
You cannot alter the order of this table without changing the model. Simply put, you'll have to find a way to organize the values you expect to return (with your query) together (find another partition key). Exactly how that looks depends on your business/query requirements.
For example, let's say that I wanted to track which customers purchased specific music albums. I might create a table that looks like this:
CREATE TABLE customers_by_album (
album TEXT,
band TEXT,
custno INT,
PRIMARY KEY (album,custno))
WITH CLUSTERING ORDER BY (custno ASC);
After inserting some data, the following query returns results ordered by custno:
aaron#cqlsh:stackoverflow> SELECT album,token(album),band,custno FROM
customers_by_album WHERE album='Moving Pictures';
album | system.token(album) | band | custno
-----------------+---------------------+------+--------
Moving Pictures | 7819329704333693835 | Rush | 11
Moving Pictures | 7819329704333693835 | Rush | 12
Moving Pictures | 7819329704333693835 | Rush | 13
Moving Pictures | 7819329704333693835 | Rush | 14
(4 rows)
This works, because I am querying data by a partition (album), and then I am "clustering" on custno which leverages the on-disk sort order. This is also the order the data was written to disk in, so Cassandra just reads it from the partition sequentially.
I wrote an article on this topic for DataStax a few years ago, and it's still quite relevant. Give it a read if you get a chance: https://www.datastax.com/dev/blog/we-shall-have-order

Related

Retrieving bucketting value in WITH statement for subsequent SELECT

I have several tables with bucketing applied. It can work great when I specify the bucket/partition parameter upfront in my SELECT query, however when I retrieve the bucket value I need from a different table - within a WITH select statement, Hive/Athena seems to no longer use the optimisation, and searches the entire database instead. I would like to learn if there is a way to write my query properly to maintain the optimisation.
For a simple example, I have two tables:
Table1
category | categoryid
---------+-----------
mass | 1
Table2
categoryid | index | value
-----------+-------+------
1 | 0 | 15
1 | 1 | 10
1 | 2 | 7
The bucketed/clustered column is categoryid. I have a single category ('mass') and would like to retrieve the value's that correspond with the category I have. So I have designed my SELECT like this:
WITH dataset AS (
SELECT categoryid
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid=dataset.categoryid
This will run, but will search the entire database it seems, because Hive doesn't know the categoryid for bucketing before commencing the search? If I swap out the final Table2.categoryid=dataset.categoryid for Table2.categoryid=1 then it will search only the fraction of the db.
So is there some way of writing this query to ensure Hive doesn't search more buckets in the second table than it has to?
Athena is based on Presto. Unless there is some modification in Athena in this area (and I think there currently isn't), this cannot be made to work in single query.
Recommended workaround: issue one query to gather dataset.categoryid values. Pass them as constant to your main query:
WITH dataset AS (
SELECT category
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid = dataset.categoryid
AND Table2.categoryid IN ( <all possible values> );
This is going to be improved with the additional of Dynamic Filtering in Presto, that the Presto Community is working on currently.

Group data and extract average in Cassandra cqlsh

Lets say we have a key-space named sensors and a table named sensor_per_row.
this table has the following structure :
sensor_id | ts | value
In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.
select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row
The outcome of this select is
sensor_id | value | day | ts
-----------+-------+------------+---------------
Sensor 2 | 52.7 | 2019-01-04 | 1546640464138
Sensor 2 | 52.8 | 2019-01-04 | 1546640564376
Sensor 2 | 52.9 | 2019-01-04 | 1546640664617
How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :
sensor_id | system.avg(value) | day
-----------+-------------------+------------
Sensor 2 | 52.52059 | 2018-12-11
Sensor 2 | 42.52059 | 2018-12-10
Sensor 3 | 32.52059 | 2018-12-11
One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ?
Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?
NoSQL Limitations
While working with NoSQL, we generally have to give up:
Some ACID guarantees.
Consistency from CAP.
Shuffling operations: JOIN, GROUP BY.
You may perform above operations by reading data(rows) from the table and summing.
You can also refer to the answer MAX(), DISTINCT and group by in Cassandra
So I found the solution , I will post it in case somebody else has the same question.
As I read the data modeling seems to be the answer. Which means :
In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .
For instance lets say we have the log schema {sensor_id , region , value} ,
The first comes in mind is to generate a table called sensor_per_row like :
sensor_id | value | region | ts
-----------+-------+------------+---------------
This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them .
Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .
For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.
If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.
So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row
I hope i gave you the idea.

No-SQL (Cassandra) data modelling of user data

How do you model user data in Cassandra?
A single table for user data, partitioned by user-ID, with different components reading/writing to different columns?
Multiple tables (one per component) with the same key structure, that occasionally need to be "joined" together on partition key?
We have various data and metadata associated with a customer, that we currently hold in separate tables with the same partitioning & clustering keys.
This leads to fething bits of information for a user from different tables (e.g. to analytics), effectively "joining" two or more Cassandra tables on their partition keys.
On the positive side, inserting to tables is done independently.
Is there a race condition when concurrently updating data under the same partition key but different columns? Or the deltas are gracefully merged on the SSTables?
Is having multiple tables with the same partition (and clustering) keys usual or an anti-pattern?
To make this more concrete, let's say we have:
CREATE TABLE example (
pk text PRIMARY KEY
col_a text
col_b text
)
Assume that for a given partition key (pk), initially both col_a, and col_b have some value (i.e. not null). and two concurrent inserts update each of them. Is there any race condition possible there? Losing one of the two updates, despite writing into separate columns?
Summary
Write conflicts are something you shouldn't need to worry about. All INSERTS/UPDATES/DELETES are Upserts in Cassandra. Everything in Cassandra is column based.
Cassandra uses a last-write wins strategy to manage conflict. As you can see by be example below, whenever you change a value the timestamp associated with that column is updated. Since you are running concurrent updates, and one thread will update col_a and another will update col_b.
Example
Initial Insert
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a, col_b ) VALUES ( '1', 'deckard', 'Blade Runner');
cqlsh:test_keyspace> select * from race_condition_test ;
pk | col_a | col_b
----+---------+--------------
1 | deckard | Blade Runner
(1 rows)
Timestamps are the same in the initial insert
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+--------------+------------------
1 | Deckard | 1526916970412357 | Blade Runner | 1526916970412357
(1 rows)
Once col_b is uptated, it's timestamp changes to reflect the change.
cqlsh:test_keyspace> insert into race_condition_test (pk, col_b ) VALUES ( '1', 'Rick');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+-------+------------------
1 | Deckard | 1526916970412357 | Rick | 1526917272641682
(1 rows)
After col_a is updated it too get's its timestamp updated to the new value
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a) VALUES ( '1', 'bounty hunter');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------------+------------------+-------+------------------
1 | bounty hunter | 1526917323082217 | Rick | 1526917272641682
(1 rows)
Recommendation
My recommendation is that you use one single table that serves your query needs. If you need to query by pk, then create one single table with all columns you need. This way you will have a single wide row that can be read back efficiently, as part of a single query.
The datamodel you describe in option 2 is a bit to relational, and is not optimal for Cassandra. You cannot perform joins natively in cassandra and you should avoid preforming joins on the client side.
Data Mode Rules:
Rule 1: Spread data Evenly across the cluster
You will want to create a partition key that will ensure the data is evenly distributed across the cluster and you don't have any hotspots.
Rule 2: Minimize the number of partitions Read
Each partition may reside in different nodes, so you should try to create a scenario where your queries go ideally to only one node for performance sake.
Rule 3: Model around your queries
Determine what queries to support
Create a table that satisfies your query (meaning that you should use one table per query pattern).
If you need to support more query patterns, then denormalize your data into additional tables that serve those queries. Avoid Secondary Indexes and Materialized Views, as they are not stable at the moment and the first one can create major performance issues when you start to increase your cluster.
If you want to read a little bit more about this I suggest this datastax page:
Basic Rules of Cassandra Data Modeling

How to get last inserted row in Cassandra?

I want to get last inserted row in Cassandra table. How to get it? Any idea?
I am developing a project for that I am replacing mysql with cassandra. I want to get rid off all sql queries and writing them all in cassandra.
Just to impart a little understanding...
As with all Cassandra query problems, the query needs to be served by model specifically designed for it. This is known as query-based modeling. Querying the last inserted row is not an intrinsic capability built into every table. You would need to design your model to support that ahead of time.
For instance, let's say I have a table storing data for users.
CREATE TABLE users (
username TEXT,
email TEXT,
firstname TEXT,
lastname TEXT,
PRIMARY KEY (username));
If I were to run a SELECT * FROM users LIMIT 1 on this table, my result set would contain a single row. That row would be the one containing the lowest hashed value of username (my partition key), because that's how Cassandra stores data in the cluster. I would have no way of knowing if it was the last one added or not, so this wouldn't be terribly useful to you.
On the other hand, let's say I had a table designed to track updates that users had made to their account info.
CREATE TABLE userUpdates (
username TEXT,
lastUpdated TIMEUUID,
email TEXT,
firstname TEXT,
lastname TEXT,
PRIMARY KEY (username,lastUpdated))
WITH CLUSTERING ORDER BY (lastUpdated DESC);
Next I'll upsert 3 rows:
> INSERT INTO userUpdates (username,lastUpdated,email,firstname,lastname)
VALUES ('bkerman',now(),'bkerman#ksp.com','Bob','Kerman');
> INSERT INTO userUpdates (username,lastUpdated,email,firstname,lastname)
VALUES ('jkerman',now(),'jkerman#ksp.com','Jebediah','Kerman');
> INSERT INTO userUpdates (username,lastUpdated,email,firstname,lastname)
VALUES ('bkerman',now(),'bobkerman#ksp.com','Bob','Kerman');
> SELECT username, email, dateof(lastUpdated) FROM userupdates;
username | email | system.dateof(lastupdated)
----------+-------------------+----------------------------
jkerman | jkerman#ksp.com | 2016-02-17 15:31:39+0000
bkerman | bobkerman#ksp.com | 2016-02-17 15:32:22+0000
bkerman | bkerman#ksp.com | 2016-02-17 15:31:38+0000
(3 rows)
If I just SELECT username, email, dateof(lastUpdated) FROM userupdates LIMIT 1 I'll get Jedediah Kerman's data, which is not the most-recently updated. However, if I limit my partition to username='bkerman', with a LIMIT 1 I will get the most-recent row for Bob Kerman.
> SELECT username, email, dateof(lastUpdated) FROM userupdates WHERE username='bkerman' LIMIT 1;
username | email | system.dateof(lastupdated)
----------+-------------------+----------------------------
bkerman | bobkerman#ksp.com | 2016-02-17 15:32:22+0000
(1 rows)
This works, because I specified a clustering order of descending on lastUpdated:
WITH CLUSTERING ORDER BY (lastUpdated DESC);
In this way, results within each partition will be returned with the most-recently upserted row at the top, hence LIMIT 1 becomes the way to query the most-recent row.
In summary, it is important to understand that:
Cassandra orders data in the cluster by the hashed value of a partition key. This helps ensure more-even data distribution.
Cassandra CLUSTERING ORDER enforces on-disk sort order of data within a partition key.
While you won't be able to get the most-recently upserted row for each table, you can design models to return that row to you for each partition.
tl;dr; Querying in Cassandra is MUCH different from that of MySQL or any RDBMS. If querying the last upserted row (for a partition) is something you need to do, there are probably ways in which you can model your table to support it.
I want to get last inserted row in Cassandra table. How to get it? Any idea?
It is not possible, what you request is a queue pattern (give me last message in) and queue is a known anti-pattern for Cassandra

Get Date Range for Cassandra - Select timeuuid with IN returning 0 rows

I'm trying to get data from a date range on Cassandra, the table is like this:
CREATE TABLE test6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((time), id)
)
But when I select a data range I get nothing:
SELECT * FROM teste WHERE time IN ( minTimeuuid('2013-01-01 00:05+0000'), now() );
(0 rows)
How can I get a date range from a Cassandra Query?
The IN condition is used to specify multiple keys for a SELECT query. To run a date range query for your table, (you're close) but you'll want to use greater-than and less-than.
Of course, you can't run a greater-than/less-than query on a partition key, so you'll need to flip your keys for this to work. This also means that you'll need to specify your id in the WHERE clause, as well:
CREATE TABLE teste6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((id), time)
)
INSERT INTO teste6 (time,id,checked,email,name)
VALUES (now(),'B26354',true,'rdeckard#lapd.gov','Rick Deckard');
SELECT * FROM teste6
WHERE id='B26354'
AND time >= minTimeuuid('2013-01-01 00:05+0000')
AND time <= now();
id | time | checked | email | name
--------+--------------------------------------+---------+-------------------+--------------
B26354 | bf0711f0-b87a-11e4-9dbe-21b264d4c94d | True | rdeckard#lapd.gov | Rick Deckard
(1 rows)
Now while this will technically work, partitioning your data by id might not work for your application. So you may need to put some more thought behind your data model and come up with a better partition key.
Edit:
Remember with Cassandra, the idea is to get a handle on what kind of queries you need to be able to fulfill. Then build your data model around that. Your original table structure might work well for a relational database, but in Cassandra that type of model actually makes it difficult to query your data in the way that you're asking.
Take a look at the modifications that I have made to your table (basically, I just reversed your partition and clustering keys). If you still need help, Patrick McFadin (DataStax's Chief Evangelist) wrote a really good article called Getting Started with Time Series Data Modeling. He has three examples that are similar to yours. In fact his first one is very similar to what I have suggested for you here.

Resources