How to get row with largest value? - google-cloud-spanner

What I thought would work is:
SELECT *
FROM customer_sale
WHERE sale_date < '2019-02-01'
GROUP BY customer_id
HAVING sale_date = MAX(sale_date)
But running this results in an error
HAVING clause expression references column sale_date which is
neither grouped nor aggregated
Is there another way to achieve this in Spanner? And more generally, why isn't the above allowed?
Edit
Example of data in customer_sale table:
customer_id sale_date
-------------------------------
1 Jan 15
1 Jan 30
1 Feb 2
1 Feb 4
2 Jan 15
2 Feb 2
And the expected result:
customer_id sale_date
-------------------------------
1 Jan 30
2 Jan 15

A HAVING clause in SQL specifies that an SQL SELECT statement should
only return rows where aggregate values meet the specified conditions.
It was added to the SQL language because the WHERE keyword could not
be used with aggregate functions
This is the test table I am using:
index, customer_id, sale_date
1 1 2017-08-25T07:00:00Z
2 1 2017-08-26T07:00:00Z
3 1 2017-08-27T07:00:00Z
4 1 2017-08-28T07:00:00Z
5 2 2017-08-29T07:00:00Z
6 2 2017-08-30T07:00:00Z
With this query:
Select customer_id, max(sale_date) as max_date
from my_test_table
group by customer_id;
I get this result:
customer_id max_date
1 2017-08-28T07:00:00Z
2 2017-08-30T07:00:00Z
Also including where statement:
Select customer_id, max(sale_date) as max_date
from my_test
where sale_date < '2017-08-28'
group by customer_id;

I had the same problem and this way I was able to solve. If you have a quite big table it might take some time.
Basically, joining your normal table with the table which has records with maximum values solves it.
select c.* from
(select * from customer_sale WHERE sale_date < '2019-02-01') c
inner join
(SELECT customer_id, max(sale_date) as max_sale_date FROM customer_sale WHERE
sale_date < '2019-02-01' group by customer_id) max_c
on c.customer_id = max_c.customer_id and c.sale_date = max_c.sale_date

Related

Way to add same keys to delta table merge

I have a delta table. Inside this delta table, contains duplicate keys. For example:
id age
1 22
1 23
1 25
2 22
2 11
When merging a new table to the delta table that looks like this:
id age
1 23
1 24
1 23
2 21
2 12
Using this function:
def upsertToDelta(microBatchOutputDF):
(student_table.alias("t").merge(
microBatchOutputDF.alias("s"),
"s.id = t.id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
It throws an error:
Cannot perform Merge as multiple source rows matched and attempted to modify the same
I understand why this is happening, but what I'd like to know is how I can remove the old keys and insert the new keys even though the ids are the same. So the resulting table should look like this:
id age
1 23
1 24
1 23
2 21
2 12
Is there a way to do this?
This looks like SCD type 1 change, where we overwrite the old data with the new ones. To handle this, you must have atleast one unique to act as merge key. A simple row_number can also be sufficient in your case, like this:
Before Merge:
Add row_number, partitioned by id column, in new data. This is handled in the merge statement below. (Just printing here for understanding)
Merge SQL:
MERGE INTO student_table AS target
USING (
SELECT id AS merge_key, id, age
FROM microBatchOutputDF
WHERE id IN (
SELECT DISTINCT id
FROM student_table
)
UNION ALL
SELECT NULL AS merge_key, id, age
FROM microBatchOutputDF
WHERE id IN (
SELECT DISTINCT id
FROM student_table
)
) AS source
ON target.id = source.id
AND target.id = source.merge_key
WHEN MATCHED
THEN
DELETE
WHEN NOT MATCHED AND source.merge_key IS NULL
THEN
INSERT (target.id, target.row_num, target.age)
VALUES (source.id, 1, source.age)
;
The result:

Cassandra where clause as a tuple

Table12
CustomerId CampaignID
1 1
1 2
2 3
1 3
4 2
4 4
5 5
val CustomerToCampaign = ((1,1),(1,2),(2,3),(1,3),(4,2),(4,4),(5,5))
Is it possible to write a query like
select CustomerId, CampaignID from Table12 where (CustomerId, CampaignID) in (CustomerToCampaign_1, CustomerToCampaign_2)
???
So the input is a tuple but the columns are not tuple but rather individual columns.
Sure, it's possible. But only on the clustering keys. That means I need to use something else as a partition key or "bucket." For this example, I'll assume that marketing campaigns are time sensitive and that we'll get a good distribution and easy of querying by using "month" as the bucket (partition).
CREATE TABLE stackoverflow.customertocampaign (
campaign_month int,
customer_id int,
campaign_id int,
customer_name text,
PRIMARY KEY (campaign_month, customer_id, campaign_id)
);
Now, I can INSERT the data described in your CustomerToCampaign variable. Then, this query works:
aploetz#cqlsh:stackoverflow> SELECT campaign_month, customer_id, campaign_id
FROM customertocampaign WHERE campaign_month=202004
AND (customer_id,campaign_id) = (1,2);
campaign_month | customer_id | campaign_id
----------------+-------------+-------------
202004 | 1 | 2
(1 rows)

Cassandra query max of a particular column for a particular ID

I am trying to write a Cassandra query and my use case is as follows
Let's say the table is
ID | Version
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
Now what I want is to get the latest version for all the IDs.
So the query should give me 2 rows. The first with Id:1 Version 2 and second with ID:2 Version:3
I tried a query like Select * from table where ID=1 and Version= MAX(Version) but it's not a valid syntax.
Can anybody help in this?
SELECT * FROM table WHERE ID = 1 LIMIT 1 would give you the highest version if your clustering key is Version ordered by descending.
CREATE TABLE table (
id int,
version int,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC);

Get distinct value from multiple rows based on key from other column

I have two tables with a relation to the attribute Sys_ID in Excel PowerPivot.
Sys_Value in Table2 is a Lookup from Table1 ( =Related(Table1[Sys_Value]) )
Table1
Sys_ID Sys_Value
Sys-1 10
Sys-2 20
Table2
ID Org_ID Sys_ID_FK Sys_ValueLookUp
1 Org-1 Sys-1 10
2 Org-2 Sys-1 10
3 Org-3 Sys-1 10
4 Org-2 Sys-2 20
5 Org-3 Sys-2 20
In a PowerPivot chart, I need Sys_ID_FK, Sys_Value_LookUp and to filter on Org_ID
I am getting the following result in the pivot chart/table:
Filter: Not set (all)
Result:
Sys-1 30
Sys-2 40
This is wrong and the correct result should be:
Filter: Not set (all)
Result:
Sys-1 10
Sys-2 20
or second example
Filter: Org-1
Result:
Sys-1 10
How can I get a result that is counting only one value per "Sys"?
Or is there a way to apply the Org-filter from table2 to table1?
The pivot table is summing the Sys_Value_Lookup for all selected rows. If you don't want that, then you can switch the aggregation to Max instead of Sum under the Value Field Settings.

Query optimization in Cassandra

I have a cassandra database that I need to query
My table looks like this:
Cycle Parameters Value
1 a 999
1 b 999
1 c 999
2 a 999
2 b 999
2 c 999
3 a 999
3 b 999
3 c 999
4 a 999
4 b 999
4 c 999
I need to get values for parameters "a" and "b" for two cycles , no matter which "cycle" it is
Example results:
Cycle Parameters Value
1 a 999
1 b 999
2 a 999
2 b 999
or
Cycle Parameters Value
1 a 999
1 b 999
3 a 999
3 b 999
Since the database is quite huge, every query optimization is welcome..
My requirements are:
I want to do everything in 1 query
Would be a plus a answer with no nested query
So far, I was able to accomplish these requirements with something like this:
select * from table where Parameters in ('a','b') sort by cycle, parameters limit 4
However, this query needs a "sortby" operation that causes huge processing in the database...
Any clues on how to do it? ....limit by partition maybe?
EDIT:
The table schema is:
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
"parameters" is the partition key and "cycle" is the clustering column
You can't query like this without ALLOW FILTERING, don't use allow filtering in production Only use it for development!
Read the datastax doc about using ALLOW FILTERING https://docs.datastax.com/en/cql/3.3/cql/cql_reference/select_r.html?hl=allow,filter
I assume your current schema is :
CREATE TABLE data (
cycle int,
parameters text,
value double,
primary key(cycle, parameters)
)
And you need another table or change your table schema to query like these
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
Now you can query
SELECT * FROM cycle_data WHERE parameters in ('a','b');
These result will automatically sorted in ascending order by cycle for every parameters

Resources