Time Serie with delta time travel in databricks - apache-spark

I'm storing in a delta table the prices of products. The schema of the table is like this:
id | price | updated
1 | 3 | 2022-03-21
2 | 4 | 2022-03-20
3 | 3 | 2022-03-20
I upsert rows using the id field as primary key and updating the price and updated field.
I'm trying to have the serie of prices over time using databrick time travel. But looking the documentation apparently I can only look 2 versions of a table like this
%sql
SELECT count(distinct id) - (
SELECT count(distinct id)
FROM table TIMESTAMP AS OF date_sub(current_date(), 7))
FROM table
Is there a way to select the different prices off all version ? Like: Distinct prices.

I would really not recommend to use time travel for that for following reasons:
If your data is updated frequently, then you will have a lot of versions, and your performance will degrade over the time, as handling of huge number of versions (10s of thousands) will put a lot of pressure on driver
It's very hard to do historical analysis, as you can see already - for each version you will need to have subqueries and union data.
Instead, you can use two tables - first with actual data, and second - with historical data, ideally, building the SCD Type 2 (Slowly Changing Dimensions) with markers for which period which price was active. You can build that second table using Change Data Feed (CDF) functionality to pull changes from first table, and applying them to the second table using MERGE operation. Databricks documentation includes example of using MERGE to build SCD Type 2 (although without CDF).
With this approach it will be easy for you to perform historical analysis, as all data will be in the same table and you don't need to use time travel

Related

Cassandra DB Query for System Date

I have one table customer_info in a Cassandra DB & it contains one column as billing_due_date, which is date field (dd-MMM-yy ex. 17-AUG-21). I need to fetch the certain fields from customer_info table based on billing_due_date where billing_due_date should be equal to system date +1.
Can anyone suggest a Cassandra DB query for this?
fetch the certain fields from customer_info table based on billing_due_date
transaction_id is primarykey , It is just generated through uuid()
Unfortunately, there really isn't going to be a good way to do this. Right now, the data in the customer_info table is distributed across all nodes in the cluster based on a hash of the transaction_id. Essentially, any query based on something other than transaction_id is going to read from multiple nodes, which is a query anti-pattern in Cassandra.
In Cassandra, you need to design your tables based on the queries that they need to support. For example, choosing transaction_id as the sole primary key may distribute well, but it doesn't offer much in the way of query flexibility.
Therefore, the best way to solve for this query, is to create a query table containing the data from customer_info with a key definition of PRIMARY KEY (billing_date,transaction_id). Then, a query like this should work:
> SELECT * FROM customer_info_by_date
WHERE billing_due_date = toDate(now()) + 2d;
billing_due_date | transaction_id | name
------------------+--------------------------------------+---------
2021-08-20 | 2fe82360-e314-4d5b-aa33-5deee9f03811 | Rinzler
2021-08-20 | 92cb9ee5-dee6-47fe-b372-0829f2e384cd | Clu
(2 rows)
Note that for this example, I am using the system date plus 2 days out. So in your case, you'll want to adjust the "duration" aspect from 2d down to 1d. Cassandra 4.0 allows date arithmetic, so this should work just fine if you are on that version. If you are not, you'll have to do the "system date plus one" calculation on the app side.
Another way to go about this, would be to create a secondary index on billing_due_date, but I don't recommend that path as it will query multiple nodes to build the result set.

Retrieving bucketting value in WITH statement for subsequent SELECT

I have several tables with bucketing applied. It can work great when I specify the bucket/partition parameter upfront in my SELECT query, however when I retrieve the bucket value I need from a different table - within a WITH select statement, Hive/Athena seems to no longer use the optimisation, and searches the entire database instead. I would like to learn if there is a way to write my query properly to maintain the optimisation.
For a simple example, I have two tables:
Table1
category | categoryid
---------+-----------
mass | 1
Table2
categoryid | index | value
-----------+-------+------
1 | 0 | 15
1 | 1 | 10
1 | 2 | 7
The bucketed/clustered column is categoryid. I have a single category ('mass') and would like to retrieve the value's that correspond with the category I have. So I have designed my SELECT like this:
WITH dataset AS (
SELECT categoryid
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid=dataset.categoryid
This will run, but will search the entire database it seems, because Hive doesn't know the categoryid for bucketing before commencing the search? If I swap out the final Table2.categoryid=dataset.categoryid for Table2.categoryid=1 then it will search only the fraction of the db.
So is there some way of writing this query to ensure Hive doesn't search more buckets in the second table than it has to?
Athena is based on Presto. Unless there is some modification in Athena in this area (and I think there currently isn't), this cannot be made to work in single query.
Recommended workaround: issue one query to gather dataset.categoryid values. Pass them as constant to your main query:
WITH dataset AS (
SELECT category
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid = dataset.categoryid
AND Table2.categoryid IN ( <all possible values> );
This is going to be improved with the additional of Dynamic Filtering in Presto, that the Presto Community is working on currently.

Group data and extract average in Cassandra cqlsh

Lets say we have a key-space named sensors and a table named sensor_per_row.
this table has the following structure :
sensor_id | ts | value
In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.
select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row
The outcome of this select is
sensor_id | value | day | ts
-----------+-------+------------+---------------
Sensor 2 | 52.7 | 2019-01-04 | 1546640464138
Sensor 2 | 52.8 | 2019-01-04 | 1546640564376
Sensor 2 | 52.9 | 2019-01-04 | 1546640664617
How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :
sensor_id | system.avg(value) | day
-----------+-------------------+------------
Sensor 2 | 52.52059 | 2018-12-11
Sensor 2 | 42.52059 | 2018-12-10
Sensor 3 | 32.52059 | 2018-12-11
One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ?
Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?
NoSQL Limitations
While working with NoSQL, we generally have to give up:
Some ACID guarantees.
Consistency from CAP.
Shuffling operations: JOIN, GROUP BY.
You may perform above operations by reading data(rows) from the table and summing.
You can also refer to the answer MAX(), DISTINCT and group by in Cassandra
So I found the solution , I will post it in case somebody else has the same question.
As I read the data modeling seems to be the answer. Which means :
In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .
For instance lets say we have the log schema {sensor_id , region , value} ,
The first comes in mind is to generate a table called sensor_per_row like :
sensor_id | value | region | ts
-----------+-------+------------+---------------
This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them .
Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .
For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.
If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.
So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row
I hope i gave you the idea.

Cassandra - do group by and join in the right way

I know - Cassandra does not supports group by. But how to achieve similar result on a big collection of data.
Let's say I have table with 1 mln rows of clicks, 1 mln with shares and table user_profile. clicks and shares store one operation per row with created_at column. On a dashboard I would like to show results grouped by day, for example:
2016-06-01 - 2016-07-01
+-------------+--------+------+
|user_profile | like |share |
+-------------+--------+------+
| John | 34 | 12 |
| Adam | 12 | 4 |
| Bruce | 4 | 2 |
+-------------+--------+------+
The question is, how can I do this in the right way:
Create table user_likes_shares with counter by date
Create UDF to group by each column and join them in the code by merging arrays by key
Select data from 3 tables group and join them in the code by merging arrays by key
Another option
If you use code to join the results, do you use Apache Spark SQL, Is the Spark the right way in this case?
Assuming that your dashboard page will show all historical results, grouped by day:
1. 'Group by' in a table: The denormalised approach is the accepted way of doing things in Cassandra as writes and disk space are cheap. If you can structure your data model (and application writes) to support this, then this is the best approach.
2. 'Group by' in a UDA: In this blog post, the author notes that all rows are pulled back to the coordinator, reconciled and aggregated there (for CL>1). So even if your clicks and shares tables are partitioned by date, Cassandra will still have to pull all rows for that date back to the coordinator, store them in the JVM heap and then process them. So this approach has reduced scalability.
3. Merging in code: This will be a much slower approach as you will have to transfer a lot more data from the coordinator to your application server.
4. Spark: This is a good approach if you have to make ad-hoc queries (e.g. analyzing data, rather than populating a web page) and can be simplified by running your Spark jobs through a notebook application (a.g. Apache Zeppelin). However, in your use case, you have the complexity of having to wait for that job to finish, write the output somewhere and then display it on a web page.

Excel pivot to calc averages of sums per each group

In Excel, I have a log of web requests that I need to analyze for bandwidth usage. I have parsed the log into a number of fields that I will groupby in different ways for different reports. Each website page load gets multiple resources - each being a separate line. The data structure:
RequestID | SIZE | IsImage | IsStatic | Language
A | 100 | TRUE | TRUE | EN
A | 110 | TRUE | FALSE | EN
A | 90 | FALSE | FALSE | EN
...
Report 1: I need the AVERAGE request size: AVERAGE( SELECT SUM(SIZE) GROUPBY RequestID ). I do not need to see the size of each individual request.
Report 2: More elaborate pivot table reports showing average request req size broken by isStatic / isImage / language / etc. This way I can check "average total images per request per language"
Is there a way to define a field/item "SUM(SIZE) GROUPBY RequestID" ?
As far as I know this is not possible to achieve in a single pivot table. This is because you need to apply two separate aggregations to the same set of number based on a condition (RequestId)
It is possible to get what you are looking for using two pivot tables, however I would not recommend it but this is how you would do it.
Create the first pivot table on your base table, add the requestId to the rows and the size to value, this will give you an intermediate table with the sum of size per requestId, you then build a second pivot table, this time using the first as the source pivot table as the source, in this instance you will only add the ‘sum of size’ value and take the average of this. See below for example
Again I would not recommend this approach for anything but the most simple analysis
A better way to do this is to use powerpivot, a separate yet related technology to the pivot tables that you have used. You will need to import the table, I have assumed with the name [Logs] with columns [RequestId] and [Size] you will then need to add a calculation
AvarageSizeOfRequests:=AVERAGEX(SUMMARIZE(Logs;Logs[RequestId];"sumOfSize";CALCULATE(sum(Logs[Size])));[SumOfSize])
This will give you the following result
The first is the strait sum which you already have, the second is the average which will be the same per requestID but will aggregate differently.
I guess I am not understanding your Q because I expect the group by for Request ID to be automatic (unavoidable in a PT with that as a Row label). Perhaps pick holes in the following and I might understand what I have misunderstood:
I have added i and s to your data just so it is clearer which column is which. It is possible it would be better to convert TRUE and FALSE into 1 and 0 so the PT might count or average these as well.
This seems vaguely along the right lines so let's try a different PT layout. It RequestID is of little or no relevance for the required analysis don't include it in the PT or, as here, park it as a Report Filter:
in which case however many millions of rows of data of the kind in the OP there are, the PT will always in effect be a 2x2 matrix at most (assuming Language is suited to Report Filter also). There is only one value per record (SIZE) and only two, boolean, variables. Language could make a difference but worst case is one such PT per Language (and bearing in mind only one such is shown in the example!...)

Resources