Cassandra - do group by and join in the right way - apache-spark

I know - Cassandra does not supports group by. But how to achieve similar result on a big collection of data.
Let's say I have table with 1 mln rows of clicks, 1 mln with shares and table user_profile. clicks and shares store one operation per row with created_at column. On a dashboard I would like to show results grouped by day, for example:
2016-06-01 - 2016-07-01
+-------------+--------+------+
|user_profile | like |share |
+-------------+--------+------+
| John | 34 | 12 |
| Adam | 12 | 4 |
| Bruce | 4 | 2 |
+-------------+--------+------+
The question is, how can I do this in the right way:
Create table user_likes_shares with counter by date
Create UDF to group by each column and join them in the code by merging arrays by key
Select data from 3 tables group and join them in the code by merging arrays by key
Another option
If you use code to join the results, do you use Apache Spark SQL, Is the Spark the right way in this case?

Assuming that your dashboard page will show all historical results, grouped by day:
1. 'Group by' in a table: The denormalised approach is the accepted way of doing things in Cassandra as writes and disk space are cheap. If you can structure your data model (and application writes) to support this, then this is the best approach.
2. 'Group by' in a UDA: In this blog post, the author notes that all rows are pulled back to the coordinator, reconciled and aggregated there (for CL>1). So even if your clicks and shares tables are partitioned by date, Cassandra will still have to pull all rows for that date back to the coordinator, store them in the JVM heap and then process them. So this approach has reduced scalability.
3. Merging in code: This will be a much slower approach as you will have to transfer a lot more data from the coordinator to your application server.
4. Spark: This is a good approach if you have to make ad-hoc queries (e.g. analyzing data, rather than populating a web page) and can be simplified by running your Spark jobs through a notebook application (a.g. Apache Zeppelin). However, in your use case, you have the complexity of having to wait for that job to finish, write the output somewhere and then display it on a web page.

Related

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this:
id | price | updated
1 | 3 | 2022-03-21
2 | 4 | 2022-03-20
3 | 3 | 2022-03-20
I upsert rows using the id field as primary key and updating the price and updated field.
I'm trying to have the serie of prices over time using databrick time travel. But looking the documentation apparently I can only look 2 versions of a table like this
%sql
SELECT count(distinct id) - (
SELECT count(distinct id)
FROM table TIMESTAMP AS OF date_sub(current_date(), 7))
FROM table
Is there a way to select the different prices off all version ? Like: Distinct prices.
I would really not recommend to use time travel for that for following reasons:
If your data is updated frequently, then you will have a lot of versions, and your performance will degrade over the time, as handling of huge number of versions (10s of thousands) will put a lot of pressure on driver
It's very hard to do historical analysis, as you can see already - for each version you will need to have subqueries and union data.
Instead, you can use two tables - first with actual data, and second - with historical data, ideally, building the SCD Type 2 (Slowly Changing Dimensions) with markers for which period which price was active. You can build that second table using Change Data Feed (CDF) functionality to pull changes from first table, and applying them to the second table using MERGE operation. Databricks documentation includes example of using MERGE to build SCD Type 2 (although without CDF).
With this approach it will be easy for you to perform historical analysis, as all data will be in the same table and you don't need to use time travel

Cassandra DB Query for System Date

I have one table customer_info in a Cassandra DB & it contains one column as billing_due_date, which is date field (dd-MMM-yy ex. 17-AUG-21). I need to fetch the certain fields from customer_info table based on billing_due_date where billing_due_date should be equal to system date +1.
Can anyone suggest a Cassandra DB query for this?
fetch the certain fields from customer_info table based on billing_due_date
transaction_id is primarykey , It is just generated through uuid()
Unfortunately, there really isn't going to be a good way to do this. Right now, the data in the customer_info table is distributed across all nodes in the cluster based on a hash of the transaction_id. Essentially, any query based on something other than transaction_id is going to read from multiple nodes, which is a query anti-pattern in Cassandra.
In Cassandra, you need to design your tables based on the queries that they need to support. For example, choosing transaction_id as the sole primary key may distribute well, but it doesn't offer much in the way of query flexibility.
Therefore, the best way to solve for this query, is to create a query table containing the data from customer_info with a key definition of PRIMARY KEY (billing_date,transaction_id). Then, a query like this should work:
> SELECT * FROM customer_info_by_date
WHERE billing_due_date = toDate(now()) + 2d;
billing_due_date | transaction_id | name
------------------+--------------------------------------+---------
2021-08-20 | 2fe82360-e314-4d5b-aa33-5deee9f03811 | Rinzler
2021-08-20 | 92cb9ee5-dee6-47fe-b372-0829f2e384cd | Clu
(2 rows)
Note that for this example, I am using the system date plus 2 days out. So in your case, you'll want to adjust the "duration" aspect from 2d down to 1d. Cassandra 4.0 allows date arithmetic, so this should work just fine if you are on that version. If you are not, you'll have to do the "system date plus one" calculation on the app side.
Another way to go about this, would be to create a secondary index on billing_due_date, but I don't recommend that path as it will query multiple nodes to build the result set.

Bulk copy from Cassandra table column to a file

I have a requirement to copy cassandra database column into a file.
The databas has 15 million records with below columns in it. I want to copy payment column data into a file. Since it a production environment that will leads to stress on cassandra clusters.
userid | contract | payment | createdDate
Any suggestions?
Out of 15 millions payment details we want to modify few (based on some condition) and insert into a different Cassandra table.
Copying to a file -> process it -> write it to new Database table. that is the plan. but first of all how to get the copy of the column from cassandra database.
Regards
Kiran
You can use Spark + Spark Cassandra Connector (SCC) to perform data loading, modification and writing back. SCC has a number of knobs that you can use to tune throughput, to not overload the cluster when reading & writing.
If you don't have Spark, you can still use the similar approach when fetching data - not issuing the select * from table (this will overload the node that handles request), but instead perform loading of the data by specific token ranges, so the queries will go to different servers and don't overload them too much. You can find code example that is doing scan by token ranges here.

Group data and extract average in Cassandra cqlsh

Lets say we have a key-space named sensors and a table named sensor_per_row.
this table has the following structure :
sensor_id | ts | value
In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.
select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row
The outcome of this select is
sensor_id | value | day | ts
-----------+-------+------------+---------------
Sensor 2 | 52.7 | 2019-01-04 | 1546640464138
Sensor 2 | 52.8 | 2019-01-04 | 1546640564376
Sensor 2 | 52.9 | 2019-01-04 | 1546640664617
How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :
sensor_id | system.avg(value) | day
-----------+-------------------+------------
Sensor 2 | 52.52059 | 2018-12-11
Sensor 2 | 42.52059 | 2018-12-10
Sensor 3 | 32.52059 | 2018-12-11
One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ?
Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?
NoSQL Limitations
While working with NoSQL, we generally have to give up:
Some ACID guarantees.
Consistency from CAP.
Shuffling operations: JOIN, GROUP BY.
You may perform above operations by reading data(rows) from the table and summing.
You can also refer to the answer MAX(), DISTINCT and group by in Cassandra
So I found the solution , I will post it in case somebody else has the same question.
As I read the data modeling seems to be the answer. Which means :
In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .
For instance lets say we have the log schema {sensor_id , region , value} ,
The first comes in mind is to generate a table called sensor_per_row like :
sensor_id | value | region | ts
-----------+-------+------------+---------------
This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them .
Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .
For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.
If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.
So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row
I hope i gave you the idea.

Are there any major disadvantages to having multiple clustering columns in cassandra?

I'm designing a cassandra table where I need to be able able to retrieve rows by their geohash. I have something that works, but I'd like to avoid range queries more so than I'm currently able to.
The current table schema is this, with geo_key containing the first five characters of the geohash string. I query using the geo_key, then range filter on the full geohash, allowing me to prefix search based on a 5 or greater length geohash:
CREATE TABLE georecords (geo_key text,geohash text, data text) PRIMARY KEY (geo_key, geohash))
My idea is that I could instead store the characters of the geohash as seperate columns, allowing me to specify as many caracters as I wanted, to do a prefix match on the geohash. My concern is what impact using multiple clustering columns might have:
CREATE TABLE georecords (g1 text,g2 text,g3 text,g4 text,g5 text,g6 text,g7 text,g8 text,geohash text, data text) PRIMARY KEY (g1,g2,g3,g4,g5,g6,g7,g8,geohash,pid))
(I'm not really concerned about the cardinality of the partition key - g1 would have minimum 30 values, and I have other workarounds for it as well)
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
This seemed like an interesting problem to help out with, so I built a few CQL tables of differing PRIMARY KEY structure and options. I then used http://geohash.org/ to come up with a few endpoints, and inserted them.
aploetz#cqlsh:stackoverflow> SELECT g1, g2, g3, g4, g5, g6, g7, g8, geohash, pid, data FROm georecords3;
g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | geohash | pid | data
----+----+----+----+----+----+----+----+--------------+------+---------------
d | p | 8 | 9 | v | c | n | e | dp89vcnem4n | 1001 | Beloit, WI
d | p | 8 | c | p | w | g | v | dp8cpwgv3 | 1003 | Harvard, IL
d | p | c | 8 | g | e | k | t | dpc8gektg8w7 | 1002 | Sheboygan, WI
9 | x | j | 6 | 5 | j | 5 | 1 | 9xj65j518 | 1004 | Denver, CO
(4 rows)
As you know, Cassandra is designed to return data with a specific, precise key. Using multiple clustering columns helps in that approach, in that you are helping Cassandra quickly identify the data you wish to retrieve.
The only thing I would think about changing, is to see if you can do without either geohash or pid in the PRIMARY KEY. My gut says to get rid of pid, as it really isn't anything that you would query by. The only value it provides is that of uniqueness, which you will need if you plan on storing the same geohashes multiple times.
Including pid in your PRIMARY KEY leaves you with one non-key column, and that allows you to use the WITH COMPACT STORAGE directive. Really the only true edge that gets you, is in saving disk space as the clustering column names are not stored with the value. This becomes apparent when looking at the table from within the cassandra-cli tool:
Without compact storage:
[default#stackoverflow] list georecords3;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:, value=, timestamp=1428766191314431)
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:data, value=42656c6f69742c205749, timestamp=1428766191314431)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:, value=, timestamp=1428766191382903)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:data, value=486172766172642c20494c, timestamp=1428766191382903)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:, value=, timestamp=1428766191276179)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:data, value=536865626f7967616e2c205749, timestamp=1428766191276179)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:, value=, timestamp=1428766191424701)
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:data, value=44656e7665722c20434f, timestamp=1428766191424701)
2 Rows Returned.
Elapsed time: 217 msec(s).
With compact storage:
[default#stackoverflow] list georecords2;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001, value=Beloit, WI, timestamp=1428765102994932)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003, value=Harvard, IL, timestamp=1428765717512832)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002, value=Sheboygan, WI, timestamp=1428765102919171)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004, value=Denver, CO, timestamp=1428766022126266)
2 Rows Returned.
Elapsed time: 39 msec(s).
But, I would recommend against using WITH COMPACT STORAGE for the following reasons:
You cannot add or remove columns after table creation.
It prevents you from having multiple non-key columns in the table.
It was really intended to be used in the old (deprecated) thrift-based approach to column family (table) modeling, and really shouldn't be used/needed anymore.
Yes, it saves you disk space, but disk space is cheap so I'd consider this a very small benefit.
I know you said "other than cardinality of the partition key", but I am going to mention it here anyway. You'll notice in my sample data set, that almost all of my rows are stored with the d partition key value. If I were to create an application like this for myself, tracking geohashes in the Wisconsin/Illinois stateline area, I would definitely have the problem of most of my data being stored in the same partition (creating a hotspot in my cluster). So knowing my use case and potential data, I would probably combine the first three or so columns into a single partition key.
The other issue with storing everything in the same partition key, is that each partition can store a max of about 2 billion columns. So it would also make sense to put some though behind whether or not your data could ever eclipse that mark. And obviously, the higher the cardinality of your partition key, the less likely you are to run into this issue.
By looking at your question, it appears to me that you have looked at your data and you understand this...definite "plus." And 30 unique values in a partition key should provide sufficient distribution. I just wanted to spend some time illustrating how big of a deal that could be.
Anyway, I also wanted to add a "nicely done," as it sounds like you are on the right track.
Edit
The still unresolved question for me is which approach will scale better, in which situations.
Scalability is more tied to how many R replicas you have across N nodes. As Cassandra scales linearly; the more nodes you add, the more transactions your application can handle. Purely from a data distribution scenario, your first model will have a higher cardinality partition key, so it will distribute much more evenly than the second. However, the first model presents a much more restrictive model in terms of query flexibility.
Additionally, if you are doing range queries within a partition (which I believe you said you are) then the second model will allow for that in a very performant manner. All data within a partition is stored on the same node. So querying multiple results for g1='d' AND g2='p'...etc...will perform extremely well.
I may just have to play with the data more and run test cases.
That is a great idea. I think you will find that the second model is the way to go (in terms of query flexibility and querying for multiple rows). If there is a performance difference between the two when it comes to single row queries, my suspicion is that it should be negligible.
Here's the best Cassandra modeling guide I've found: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
I've used composite columns (6 of them) successfully for very high write/read loads. There is no significant performance penalty when using compact storage (http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create_table_r.html).
Compact storage means the data is stored internally in a single row, with the limitation that you can only have one data column. That seems to suit your application well, regardless of which data model you choose, and would make maximal use of your geo_key filtering.
Another aspect to consider is that the columns are sorted in Cassandra. Having more clustering columns will improve the sorting speed and potentially the lookup.
However, in your case, I'd start with having the geohash as a row key and turn on row cache for fast lookup (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1). If the performance is lacking there, I'd run performance tests on different data representations.

Resources