partition by multiple columns in Spark SQL not working properly - apache-spark

I want to partition by three columns in my query :
user id
cancelation month year.
retention month year.
I used row number and partition by as follows
row_number() over (partition by user_id ,cast ( date_format(cancelation_date,'yyyyMM') as integer),cast ( date_format(retention_date,'yyyyMM') as integer) order by cast ( date_format(cancelation_date,'yyyyMM') as integer) asc, cast ( date_format(retention_date,'yyyyMM') as integer) asc) as row_count
example of the output I got :
| user_id |cancelation_date |cancelation_month_year|retention_date|retention_month_year|row_count|
| -------- | -------------- |----------------------|--------------|--------------------|---------|
| 566 | 28-5-2020 | 202005 | 20-7-2020 | 202007 |1 |
| 566 | 28-5-2020 | 202005 | 30-7-2-2020 | 202007 |2 |
example of the output I want to get:
user_id
cancelation_date
cancelation_month_year
retention_date
retention_month_year
row_count
566
28-5-2020
202005
20-7-2020
202007
1
566
28-5-2020
202005
30-7-2-2020
202007
1
note that user may have more than cancelation months, for example f he has canceled in August , I want row count =2 for all dates in August and so on.
it's not obvious why partition by is partitioning by retention date instead of partitioning by retention month year.

I get the impression that row_number is not what you want, rather you are interested in dense_rank, wherein you would get your expected output.

Related

SQLite Database selecting MAX(column) on two columns, while also selecting a Distinct value from a column

I'm attempting to Select information from an SQLite Database I created (something similar to this):
Date Time Acc TotalAcc Proc TotalProc
21-12-01 | 00:00 | 133 | 133 | 76 | 76
21-12-01 | 01:00 | 270 | 403 | 260 | 336
21-12-01 | 02:00 | 35 | 438 | 24 | 360
21-12-01 | 02:00 | 50 | 453 | 30 | 366
21-12-02 | 00:00 | 113 | 113 | 89 | 89
21-12-02 | 07:00 | 2 | 1290 | 6 | 1199
21-12-02 | 07:00 | 28 | 1316 | 17 | 1210
21-12-02 | 07:00 | 432 | 1720 | 384 | 1577
21-12-02 | 07:00 | 502 | 2222 | 403 | 1975
The information I'm looking to gather: a unique Date (only 1 from each day), with the Max Time (in this case it would be 07:00 for 21-12-01, and 02:00 for 21-12-02).
The final metric I want for sorting (this is where I'm having trouble): I also want to select the row that contains the highest TotalAcc.
Currently, this is the SQL logic I'm using to pull data:
example =
"SELECT DISTINCT Date, TotalAcc, TotalProc, MAX(Time)
FROM table_name
GROUP BY Date
ORDER BY Date DESC, MAX(Time) DESC"
df = pd.read_sql_query(example, con)
print(df)
ouput
The data I'm looking to take from the database should look more like this:
Date TotalAcc TotalProc MAX(Time)
0 | 21-12-02 | 453 | 366 | 02:00
1 | 21-12-01 | 2222 | 1975 | 07:00
I've tried using MAX(TotalAcc) instead of TotalAcc when selecting the data, but it returns me a number that's different from the actual max value in the column for that given time and date.
Setting example = 'SELECT MAX(TotalAcc) FROM table_name' returns a non-max value (1290, for example).
I apologize for not giving a total replicable example, I pull my data points from a source, which populates the table I create like this:
with con:
con.execute('''
CREATE TABLE table_name (
Date TEXT,
Time TEXT,
Acc TEXT,
TotalAcc TEXT,
Proc TEXT,
TotalProc TEXT
);''')
Any and all ideas are appreciated, SQL logic seems a bit confusing at times.
I suspect that you want a query along these lines:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Date
ORDER BY Time DESC, TotalAcc DESC) rn
FROM table_name
)
SELECT Date, Time, Acc, TotalAcc, Proc, TotalProc
FROM cte
WHERE rn = 1;
This will return one row per date, having the max time. Should two or more rows on the same date also have the same max time, then the row having the highest TotalProc will be selected.

Cassandra: How to model data so that percentage change in time ranges can be calculated?

I have very huge amount of data, which I plan to store in Cassandra. I am new to Cassandra and am trying to find a data model that will work for me.
My data is various parameters for commodities gathered over irregular time intervals:
commodity_id | timestamp | param1 | param2
c1 | '2018-01-01' | 5 | 15
c1 | '2018-01-03' | 7 | 15
c1 | '2018-01-08' | 8 | 10
c2 | '2018-01-01' | 100 | 13
c2 | '2018-01-02' | 140 | 13
c2 | '2018-01-05' | 130 | 13
c2 | '2018-01-06' | 150 | 13
I need to query the database, and get commodity IDs by "percentage change" in the params.
Ex. Find out all commodities whose param2 increased by more than 50% between '2018-01-02' and '2018-01-06'
CREATE TABLE "commodity" (
commodity_id text,
timestamp date,
param1 int,
param2 int,
PRIMARY KEY (commodity_id, timestamp)
)
You should be fine with this table. You can expect daysPerYear entries for a commodity partition, which is reasonably small so you dont need any artificial keys. Even if you have a large number of commodities, you wont run out of partitions, as the murmur3 partitioner actually has a range of -2^63 to +2^63-1. That are 18,446,744,073,709,551,616 possible values.
I would pull the data from cassandra and calculate the values in the app layer.

Understanding Cassandra static field

I learn Cassandra through its documentation. Now I'm learning about batch and static fields.
In their example at the end of the page, they somehow managed to make balance have two different values (-200, -208) even though it's a static field.
Could someone explain to me how this is possible? I've read the whole page but I did not catch on.
In Cassandra static field is static under a partition key.
Example : Let's define a table
CREATE TABLE static_test (
pk int,
ck int,
d int,
s int static,
PRIMARY KEY (pk, ck)
);
Here pk is the partition key and ck is the clustering key.
Let's insert some data :
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 1, 10, 100, 1000);
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 2, 20, 200, 2000);
If we select the data
pk | ck | s | d
----+----+------+-----
1 | 10 | 1000 | 100
2 | 20 | 2000 | 200
here for partition key pk = 1 static field s value is 1000 and for partition key pk = 2 static field s value is 2000
If we insert/update static field s value of partition key pk = 1
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 1, 11, 101, 1001);
Then static field s value will change for all the rows of the partition key pk = 1
pk | ck | s | d
----+----+------+-----
1 | 10 | 1001 | 100
1 | 11 | 1001 | 101
2 | 20 | 2000 | 200
In a table that uses clustering columns, non-clustering columns can be declared static in the table definition. Static columns are only static within a given partition.
Example:
CREATE TABLE test (
partition_column text,
static_column text STATIC,
clustering_column int,
PRIMARY KEY (partition_column , clustering_column)
);
INSERT INTO test (partition_column, static_column, clustering_column) VALUES ('key1', 'A', 0);
INSERT INTO test (partition_column, clustering_column) VALUES ('key1', 1);
SELECT * FROM test;
Results:
primary_column | clustering_column | static_column
----------------+-------------------+--------------
key1 | 0 | A
key1 | 1 | A
Observation:
Once declared static, the column inherits the value from given partition key
Now, lets insert another record
INSERT INTO test (partition_column, static_column, clustering_column) VALUES ('key1', 'C', 2);
SELECT * FROM test;
Results:
primary_column | clustering_column | static_column
----------------+-------------------+--------------
key1 | 0 | C
key1 | 1 | C
key1 | 2 | C
Observation:
If you update the static key, or insert another record with updated static column value, the value is reflected across all the columns ==> static column values are static (constant) across given partition column
Restriction (from the DataStax reference documentation below):
A table that does not define any clustering columns cannot have a static column. The table having no clustering columns has a one-row partition in which every column is inherently static.
A table defined with the COMPACT STORAGE directive cannot have a static column.
A column designated to be the partition key cannot be static.
Reference : DataStax Reference
In the example on the page you've linked they don't have different values at the same point in time.
They first have the static balance field set to -208 for the whole user1 partition:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | -208 | 8 | burrito | False
user1 | 2 | -208 | 200 | hotel room | False
Then they apply a batch update statement that sets the balance value to -200:
BEGIN BATCH
UPDATE purchases SET balance=-200 WHERE user='user1' IF balance=-208;
UPDATE purchases SET paid=true WHERE user='user1' AND expense_id=1 IF paid=false;
APPLY BATCH;
This updates the balance field for the whole user1 partition to -200:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | -200 | 8 | burrito | True
user1 | 2 | -200 | 200 | hotel room | False
The point of a static fields is that you can update/change its value for the whole partition at once. So if I would execute the following statement:
UPDATE purchases SET balance=42 WHERE user='user1'
I would get the following result:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | 42 | 8 | burrito | True
user1 | 2 | 42 | 200 | hotel room | False

retrieving data from cassandra database

I'm working on smart parking data stored in Cassandra database and i'm trying to get the last status of each device.
I'm working on self-made dataset.
here's the description of the table.
table description
select * from parking.meters
need help please !
trying to get the last status of each device
In Cassandra, you need to design your tables according to your query patterns. Building a table, filling it with data, and then trying to fulfill a query requirement is a very backward approach. The point, is that if you really need to satisfy that query, then your table should have been designed to serve that query from the beginning.
That being said, there may still be a way to make this work. You haven't mentioned which version of Cassandra you are using, but if you are on 3.6+, you can use the PER PARTITION LIMIT clause on your SELECT.
If I build your table structure and INSERT some of your rows:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters ;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 20 | 2017-01-10T09:11:51Z | True
1 | 20 | 2017-01-01T13:51:50Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 7 | 2016-12-02T16:50:04Z | True
1 | 7 | 2016-11-24T23:38:31Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
1 | 19 | 2016-11-22T15:15:23Z | False
(8 rows)
And I consider your PRIMARY KEY and CLUSTERING ORDER definitions:
PRIMARY KEY ((parking_id, device_id), date, status)
) WITH CLUSTERING ORDER BY (date DESC, status ASC);
You are at least clustering by date (which should be an actual date type, not a text), so that will order your rows in a way that helps you here:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters PER PARTITION LIMIT 1;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
(3 rows)

Sum of Max in PivotTable for Excel

I have a PivotTable that comes from the following table:
+---------+---+-----+
| A | B | C |
+-+---------+---+-----+
|1| Date |Id |Value|
+-+---------+---+-----+
|2|4/01/2013|1 |4 |
+-+---------+---+-----+
|3|4/01/2013|2 |5 |
+-+---------+---+-----+
|4|4/01/2013|1 |20 |
+-+---------+---+-----+
|5|4/02/2013|2 |20 |
+-+---------+---+-----+
|6|4/02/2013|1 |15 |
+-+---------+---+-----+
And I want to aggregate first by Id and then by Date, using Max to aggregate by Id and then Sum to aggregate by Date. The resulting table would look like this:
+---------+----------------+
| A | B |
+-+---------+----------------+
|1| Date |Sum(Max(Id,Date)|
+-+---------+----------------+
|2|4/01/2013|25 |
+-+---------+----------------+
|3|4/02/2013|35 |
+-+---------+----------------+
The 25 above comes from getting the Max per Id per Date (Max(1, 4/01/2013) -> 20 and Max(2, 4/01/2013) -> 5, so the Sum of those Max is 25.
I can do the two levels of aggregation easily by adding the Date and Id columns into the Rows section of the PivotTable, but when choosing an aggregation function for Value, I can either choose Max, getting a Max of Max, or Sum, getting a Sum of Sum. That is, I cannot get a Sum of Max.
Do you know how to achieve this? Ideally, the solution would not be to compute a PivotTable and then copy from there or get a formula, because that would break easily if I want to dynamically change fields.
Thanks!
This is how I would do it in SQL:
SELECT DATE, SUM(MAXED_VAL) as SummedMaxedVal
FROM (
SELECT DATE, ID, MAX(VALUE) as MAXED_VAL
FROM table
GROUP BY DATE, ID
)
GROUP BY DATE

Resources