Group data and extract average in Cassandra cqlsh - cassandra

Lets say we have a key-space named sensors and a table named sensor_per_row.
this table has the following structure :
sensor_id | ts | value
In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.
select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row
The outcome of this select is
sensor_id | value | day | ts
-----------+-------+------------+---------------
Sensor 2 | 52.7 | 2019-01-04 | 1546640464138
Sensor 2 | 52.8 | 2019-01-04 | 1546640564376
Sensor 2 | 52.9 | 2019-01-04 | 1546640664617
How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :
sensor_id | system.avg(value) | day
-----------+-------------------+------------
Sensor 2 | 52.52059 | 2018-12-11
Sensor 2 | 42.52059 | 2018-12-10
Sensor 3 | 32.52059 | 2018-12-11
One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ?
Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?

NoSQL Limitations
While working with NoSQL, we generally have to give up:
Some ACID guarantees.
Consistency from CAP.
Shuffling operations: JOIN, GROUP BY.
You may perform above operations by reading data(rows) from the table and summing.
You can also refer to the answer MAX(), DISTINCT and group by in Cassandra

So I found the solution , I will post it in case somebody else has the same question.
As I read the data modeling seems to be the answer. Which means :
In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .
For instance lets say we have the log schema {sensor_id , region , value} ,
The first comes in mind is to generate a table called sensor_per_row like :
sensor_id | value | region | ts
-----------+-------+------------+---------------
This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them .
Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .
For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.
If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.
So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row
I hope i gave you the idea.

Related

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this:
id | price | updated
1 | 3 | 2022-03-21
2 | 4 | 2022-03-20
3 | 3 | 2022-03-20
I upsert rows using the id field as primary key and updating the price and updated field.
I'm trying to have the serie of prices over time using databrick time travel. But looking the documentation apparently I can only look 2 versions of a table like this
%sql
SELECT count(distinct id) - (
SELECT count(distinct id)
FROM table TIMESTAMP AS OF date_sub(current_date(), 7))
FROM table
Is there a way to select the different prices off all version ? Like: Distinct prices.
I would really not recommend to use time travel for that for following reasons:
If your data is updated frequently, then you will have a lot of versions, and your performance will degrade over the time, as handling of huge number of versions (10s of thousands) will put a lot of pressure on driver
It's very hard to do historical analysis, as you can see already - for each version you will need to have subqueries and union data.
Instead, you can use two tables - first with actual data, and second - with historical data, ideally, building the SCD Type 2 (Slowly Changing Dimensions) with markers for which period which price was active. You can build that second table using Change Data Feed (CDF) functionality to pull changes from first table, and applying them to the second table using MERGE operation. Databricks documentation includes example of using MERGE to build SCD Type 2 (although without CDF).
With this approach it will be easy for you to perform historical analysis, as all data will be in the same table and you don't need to use time travel

Cassandra returns Unordered result set for numeric values

I am new to No SQL and just started learning Cassandra, I have a following question to ask. I have created a simple table with one column to understand Cassandra partition and clustering and trying to query all the values after insertion.
My table structure
create table if not exists music_library(custno int, primary key(custno))
I inserted following values in a sequential order
insert into music_library(custno) values (11)
insert into music_library(custno) values (12)
insert into music_library(custno) values (13)
insert into music_library(custno) values (14)
then I was querying this table
select * from music_library
it returns values in the following order
13
11
14
12
but i was expecting
11
12
13
14
Why its behaving like that?
I ran your exact statements and produced the same result. But I also adjusted your query to run the token function, and this is what it produced:
aaron#cqlsh:stackoverflow> select custno,token(custno) from music_library;
custno | system.token(custno)
--------+----------------------
13 | -5034495173465742853
11 | -4156302194539278891
14 | 4279681877540623768
12 | 8582886034424406875
(4 rows)
Why its behaving like that?
Simply put, because Cassandra cannot order results by the values of the partition keys.
As your table has a single primary key of custno, your rows are partitioned by the hashed token value of custno, and written to the nodes responsible for those token ranges. When you run an unbound query in Cassandra (query without a WHERE clause), the results are returned ordered by the hashed token values of their partition keys.
Using ORDER BY won't work here, either. ORDER BY can only sort data within a partition, and even then only on clustering keys. To get the custno values to order properly, you will need to find a new partition key, and then specify custno as a clustering key in an ascending direction.
Edit 20190916 - follow-up clarifications
Does this tokenization will happen for all the columns?
No. The partition keys are hashed into a token to determine their placement in the cluster (which node(s) they are written to). Individual column values are written within a partition.
How will I return the inserted number with the order?
You cannot alter the order of this table without changing the model. Simply put, you'll have to find a way to organize the values you expect to return (with your query) together (find another partition key). Exactly how that looks depends on your business/query requirements.
For example, let's say that I wanted to track which customers purchased specific music albums. I might create a table that looks like this:
CREATE TABLE customers_by_album (
album TEXT,
band TEXT,
custno INT,
PRIMARY KEY (album,custno))
WITH CLUSTERING ORDER BY (custno ASC);
After inserting some data, the following query returns results ordered by custno:
aaron#cqlsh:stackoverflow> SELECT album,token(album),band,custno FROM
customers_by_album WHERE album='Moving Pictures';
album | system.token(album) | band | custno
-----------------+---------------------+------+--------
Moving Pictures | 7819329704333693835 | Rush | 11
Moving Pictures | 7819329704333693835 | Rush | 12
Moving Pictures | 7819329704333693835 | Rush | 13
Moving Pictures | 7819329704333693835 | Rush | 14
(4 rows)
This works, because I am querying data by a partition (album), and then I am "clustering" on custno which leverages the on-disk sort order. This is also the order the data was written to disk in, so Cassandra just reads it from the partition sequentially.
I wrote an article on this topic for DataStax a few years ago, and it's still quite relevant. Give it a read if you get a chance: https://www.datastax.com/dev/blog/we-shall-have-order

No-SQL (Cassandra) data modelling of user data

How do you model user data in Cassandra?
A single table for user data, partitioned by user-ID, with different components reading/writing to different columns?
Multiple tables (one per component) with the same key structure, that occasionally need to be "joined" together on partition key?
We have various data and metadata associated with a customer, that we currently hold in separate tables with the same partitioning & clustering keys.
This leads to fething bits of information for a user from different tables (e.g. to analytics), effectively "joining" two or more Cassandra tables on their partition keys.
On the positive side, inserting to tables is done independently.
Is there a race condition when concurrently updating data under the same partition key but different columns? Or the deltas are gracefully merged on the SSTables?
Is having multiple tables with the same partition (and clustering) keys usual or an anti-pattern?
To make this more concrete, let's say we have:
CREATE TABLE example (
pk text PRIMARY KEY
col_a text
col_b text
)
Assume that for a given partition key (pk), initially both col_a, and col_b have some value (i.e. not null). and two concurrent inserts update each of them. Is there any race condition possible there? Losing one of the two updates, despite writing into separate columns?
Summary
Write conflicts are something you shouldn't need to worry about. All INSERTS/UPDATES/DELETES are Upserts in Cassandra. Everything in Cassandra is column based.
Cassandra uses a last-write wins strategy to manage conflict. As you can see by be example below, whenever you change a value the timestamp associated with that column is updated. Since you are running concurrent updates, and one thread will update col_a and another will update col_b.
Example
Initial Insert
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a, col_b ) VALUES ( '1', 'deckard', 'Blade Runner');
cqlsh:test_keyspace> select * from race_condition_test ;
pk | col_a | col_b
----+---------+--------------
1 | deckard | Blade Runner
(1 rows)
Timestamps are the same in the initial insert
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+--------------+------------------
1 | Deckard | 1526916970412357 | Blade Runner | 1526916970412357
(1 rows)
Once col_b is uptated, it's timestamp changes to reflect the change.
cqlsh:test_keyspace> insert into race_condition_test (pk, col_b ) VALUES ( '1', 'Rick');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+-------+------------------
1 | Deckard | 1526916970412357 | Rick | 1526917272641682
(1 rows)
After col_a is updated it too get's its timestamp updated to the new value
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a) VALUES ( '1', 'bounty hunter');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------------+------------------+-------+------------------
1 | bounty hunter | 1526917323082217 | Rick | 1526917272641682
(1 rows)
Recommendation
My recommendation is that you use one single table that serves your query needs. If you need to query by pk, then create one single table with all columns you need. This way you will have a single wide row that can be read back efficiently, as part of a single query.
The datamodel you describe in option 2 is a bit to relational, and is not optimal for Cassandra. You cannot perform joins natively in cassandra and you should avoid preforming joins on the client side.
Data Mode Rules:
Rule 1: Spread data Evenly across the cluster
You will want to create a partition key that will ensure the data is evenly distributed across the cluster and you don't have any hotspots.
Rule 2: Minimize the number of partitions Read
Each partition may reside in different nodes, so you should try to create a scenario where your queries go ideally to only one node for performance sake.
Rule 3: Model around your queries
Determine what queries to support
Create a table that satisfies your query (meaning that you should use one table per query pattern).
If you need to support more query patterns, then denormalize your data into additional tables that serve those queries. Avoid Secondary Indexes and Materialized Views, as they are not stable at the moment and the first one can create major performance issues when you start to increase your cluster.
If you want to read a little bit more about this I suggest this datastax page:
Basic Rules of Cassandra Data Modeling

Cassandra - do group by and join in the right way

I know - Cassandra does not supports group by. But how to achieve similar result on a big collection of data.
Let's say I have table with 1 mln rows of clicks, 1 mln with shares and table user_profile. clicks and shares store one operation per row with created_at column. On a dashboard I would like to show results grouped by day, for example:
2016-06-01 - 2016-07-01
+-------------+--------+------+
|user_profile | like |share |
+-------------+--------+------+
| John | 34 | 12 |
| Adam | 12 | 4 |
| Bruce | 4 | 2 |
+-------------+--------+------+
The question is, how can I do this in the right way:
Create table user_likes_shares with counter by date
Create UDF to group by each column and join them in the code by merging arrays by key
Select data from 3 tables group and join them in the code by merging arrays by key
Another option
If you use code to join the results, do you use Apache Spark SQL, Is the Spark the right way in this case?
Assuming that your dashboard page will show all historical results, grouped by day:
1. 'Group by' in a table: The denormalised approach is the accepted way of doing things in Cassandra as writes and disk space are cheap. If you can structure your data model (and application writes) to support this, then this is the best approach.
2. 'Group by' in a UDA: In this blog post, the author notes that all rows are pulled back to the coordinator, reconciled and aggregated there (for CL>1). So even if your clicks and shares tables are partitioned by date, Cassandra will still have to pull all rows for that date back to the coordinator, store them in the JVM heap and then process them. So this approach has reduced scalability.
3. Merging in code: This will be a much slower approach as you will have to transfer a lot more data from the coordinator to your application server.
4. Spark: This is a good approach if you have to make ad-hoc queries (e.g. analyzing data, rather than populating a web page) and can be simplified by running your Spark jobs through a notebook application (a.g. Apache Zeppelin). However, in your use case, you have the complexity of having to wait for that job to finish, write the output somewhere and then display it on a web page.

Are there any major disadvantages to having multiple clustering columns in cassandra?

I'm designing a cassandra table where I need to be able able to retrieve rows by their geohash. I have something that works, but I'd like to avoid range queries more so than I'm currently able to.
The current table schema is this, with geo_key containing the first five characters of the geohash string. I query using the geo_key, then range filter on the full geohash, allowing me to prefix search based on a 5 or greater length geohash:
CREATE TABLE georecords (geo_key text,geohash text, data text) PRIMARY KEY (geo_key, geohash))
My idea is that I could instead store the characters of the geohash as seperate columns, allowing me to specify as many caracters as I wanted, to do a prefix match on the geohash. My concern is what impact using multiple clustering columns might have:
CREATE TABLE georecords (g1 text,g2 text,g3 text,g4 text,g5 text,g6 text,g7 text,g8 text,geohash text, data text) PRIMARY KEY (g1,g2,g3,g4,g5,g6,g7,g8,geohash,pid))
(I'm not really concerned about the cardinality of the partition key - g1 would have minimum 30 values, and I have other workarounds for it as well)
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
This seemed like an interesting problem to help out with, so I built a few CQL tables of differing PRIMARY KEY structure and options. I then used http://geohash.org/ to come up with a few endpoints, and inserted them.
aploetz#cqlsh:stackoverflow> SELECT g1, g2, g3, g4, g5, g6, g7, g8, geohash, pid, data FROm georecords3;
g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | geohash | pid | data
----+----+----+----+----+----+----+----+--------------+------+---------------
d | p | 8 | 9 | v | c | n | e | dp89vcnem4n | 1001 | Beloit, WI
d | p | 8 | c | p | w | g | v | dp8cpwgv3 | 1003 | Harvard, IL
d | p | c | 8 | g | e | k | t | dpc8gektg8w7 | 1002 | Sheboygan, WI
9 | x | j | 6 | 5 | j | 5 | 1 | 9xj65j518 | 1004 | Denver, CO
(4 rows)
As you know, Cassandra is designed to return data with a specific, precise key. Using multiple clustering columns helps in that approach, in that you are helping Cassandra quickly identify the data you wish to retrieve.
The only thing I would think about changing, is to see if you can do without either geohash or pid in the PRIMARY KEY. My gut says to get rid of pid, as it really isn't anything that you would query by. The only value it provides is that of uniqueness, which you will need if you plan on storing the same geohashes multiple times.
Including pid in your PRIMARY KEY leaves you with one non-key column, and that allows you to use the WITH COMPACT STORAGE directive. Really the only true edge that gets you, is in saving disk space as the clustering column names are not stored with the value. This becomes apparent when looking at the table from within the cassandra-cli tool:
Without compact storage:
[default#stackoverflow] list georecords3;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:, value=, timestamp=1428766191314431)
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:data, value=42656c6f69742c205749, timestamp=1428766191314431)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:, value=, timestamp=1428766191382903)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:data, value=486172766172642c20494c, timestamp=1428766191382903)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:, value=, timestamp=1428766191276179)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:data, value=536865626f7967616e2c205749, timestamp=1428766191276179)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:, value=, timestamp=1428766191424701)
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:data, value=44656e7665722c20434f, timestamp=1428766191424701)
2 Rows Returned.
Elapsed time: 217 msec(s).
With compact storage:
[default#stackoverflow] list georecords2;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001, value=Beloit, WI, timestamp=1428765102994932)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003, value=Harvard, IL, timestamp=1428765717512832)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002, value=Sheboygan, WI, timestamp=1428765102919171)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004, value=Denver, CO, timestamp=1428766022126266)
2 Rows Returned.
Elapsed time: 39 msec(s).
But, I would recommend against using WITH COMPACT STORAGE for the following reasons:
You cannot add or remove columns after table creation.
It prevents you from having multiple non-key columns in the table.
It was really intended to be used in the old (deprecated) thrift-based approach to column family (table) modeling, and really shouldn't be used/needed anymore.
Yes, it saves you disk space, but disk space is cheap so I'd consider this a very small benefit.
I know you said "other than cardinality of the partition key", but I am going to mention it here anyway. You'll notice in my sample data set, that almost all of my rows are stored with the d partition key value. If I were to create an application like this for myself, tracking geohashes in the Wisconsin/Illinois stateline area, I would definitely have the problem of most of my data being stored in the same partition (creating a hotspot in my cluster). So knowing my use case and potential data, I would probably combine the first three or so columns into a single partition key.
The other issue with storing everything in the same partition key, is that each partition can store a max of about 2 billion columns. So it would also make sense to put some though behind whether or not your data could ever eclipse that mark. And obviously, the higher the cardinality of your partition key, the less likely you are to run into this issue.
By looking at your question, it appears to me that you have looked at your data and you understand this...definite "plus." And 30 unique values in a partition key should provide sufficient distribution. I just wanted to spend some time illustrating how big of a deal that could be.
Anyway, I also wanted to add a "nicely done," as it sounds like you are on the right track.
Edit
The still unresolved question for me is which approach will scale better, in which situations.
Scalability is more tied to how many R replicas you have across N nodes. As Cassandra scales linearly; the more nodes you add, the more transactions your application can handle. Purely from a data distribution scenario, your first model will have a higher cardinality partition key, so it will distribute much more evenly than the second. However, the first model presents a much more restrictive model in terms of query flexibility.
Additionally, if you are doing range queries within a partition (which I believe you said you are) then the second model will allow for that in a very performant manner. All data within a partition is stored on the same node. So querying multiple results for g1='d' AND g2='p'...etc...will perform extremely well.
I may just have to play with the data more and run test cases.
That is a great idea. I think you will find that the second model is the way to go (in terms of query flexibility and querying for multiple rows). If there is a performance difference between the two when it comes to single row queries, my suspicion is that it should be negligible.
Here's the best Cassandra modeling guide I've found: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
I've used composite columns (6 of them) successfully for very high write/read loads. There is no significant performance penalty when using compact storage (http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create_table_r.html).
Compact storage means the data is stored internally in a single row, with the limitation that you can only have one data column. That seems to suit your application well, regardless of which data model you choose, and would make maximal use of your geo_key filtering.
Another aspect to consider is that the columns are sorted in Cassandra. Having more clustering columns will improve the sorting speed and potentially the lookup.
However, in your case, I'd start with having the geohash as a row key and turn on row cache for fast lookup (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1). If the performance is lacking there, I'd run performance tests on different data representations.

Resources