Way to add same keys to delta table merge - databricks

I have a delta table. Inside this delta table, contains duplicate keys. For example:
id age
1 22
1 23
1 25
2 22
2 11
When merging a new table to the delta table that looks like this:
id age
1 23
1 24
1 23
2 21
2 12
Using this function:
def upsertToDelta(microBatchOutputDF):
(student_table.alias("t").merge(
microBatchOutputDF.alias("s"),
"s.id = t.id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
It throws an error:
Cannot perform Merge as multiple source rows matched and attempted to modify the same
I understand why this is happening, but what I'd like to know is how I can remove the old keys and insert the new keys even though the ids are the same. So the resulting table should look like this:
id age
1 23
1 24
1 23
2 21
2 12
Is there a way to do this?

This looks like SCD type 1 change, where we overwrite the old data with the new ones. To handle this, you must have atleast one unique to act as merge key. A simple row_number can also be sufficient in your case, like this:
Before Merge:
Add row_number, partitioned by id column, in new data. This is handled in the merge statement below. (Just printing here for understanding)
Merge SQL:
MERGE INTO student_table AS target
USING (
SELECT id AS merge_key, id, age
FROM microBatchOutputDF
WHERE id IN (
SELECT DISTINCT id
FROM student_table
)
UNION ALL
SELECT NULL AS merge_key, id, age
FROM microBatchOutputDF
WHERE id IN (
SELECT DISTINCT id
FROM student_table
)
) AS source
ON target.id = source.id
AND target.id = source.merge_key
WHEN MATCHED
THEN
DELETE
WHEN NOT MATCHED AND source.merge_key IS NULL
THEN
INSERT (target.id, target.row_num, target.age)
VALUES (source.id, 1, source.age)
;
The result:

Related

looping string list and get no record count from table

I have string values get from a table using listagg(column,',')
so I want to loop this string list and set into where clause for another table
then I want to get a count when no any records in the table (Number of times with no record)
I'm writing this inside the plsql procedure
order_id
name
10
test1
20
test2
22
test3
25
test4
col_id
product
order_id
1
pro1
10
2
pro2
30
3
pro2
38
expected result : count(Number of times with no record) in 2nd table
count = 3
because there is no any record of 20,22,25 order ids in 2nd table
only have record for order_id - 10
my queries
SELECT listagg(ord.order_id,',')
into wk_orderids
from orders ord,
where ord.id_no = wk_id_no;
loop
-- do my stuff
end loop
wk_orderids values = ('10','20','22','25')
I want to loop this one(wk_orderids) and set it one by one into a select query where clause
then want to get the count Number of times with no record
If you want to count ORDER_IDs in the 2nd table that don't exist in ORDER_ID column of the 1st table, then your current approach looks as if you were given a task to do that in the most complicated way. Aggregating values, looping through them, adding values into a where clause (which then requires dynamic SQL) ... OK, but - why? Why not simply
select count(*)
from (select order_id from first_table
minus
select order_id from second_table
);

Update the dataframe if record exist and extract unmatched record in pyspark

I have two dataframe. In 1st dataframe is having full record and 2nd dataframe is incremental data. Need to create logic that if new record should insert in parent dataframe and existing record should replace by incremental record.
Example:
parent_df:
id src_nm
10 resource_mg
15 accessible
17 nominated
18 emerging
19 deploying
Increment_df:
id src_nm
18 accessible
19 production
23 migration
25 running
below result I want:
parent_df should update/insert by increment_df & unmatched record should be those record which is present in parent_df but not in increment_df. Thanks in advance!!
unmatched_df:
id src_nm
10 resource_mg
15 accessible
17 nominated
parent_df:
id src_nm
10 resource_mg
15 accessible
17 nominated
18 accessible
19 production
23 migration
25 running
When you want some data that exists in one side but not in other side with a certain lookup column, you can use left_anti JOIN.
unmatched_df = parent_df.join(increment_df, on='id', how='left_anti')
For parent_df, you need a little more step than just joining. You want all data from both side with updating the overlap, in this case, you first join with outer, which is to get all records from both. Then use coalesce.
parent_df = (parent_df.join(increment_df, on='id', how='outer')
.select('id', F.coalesce(increment_df.src_nm, parent_df.src_nm).alias('src_nm')))
Ref:
coalesce(col1, col2, col3)
# If col1 is not null, take value from col1.
# If col1 is null and col2 is not null, take value from col2.
# If col1 and col2 are null, take value from col3.

I have a employees table and a tasks table (posgresql). I need to assign many tasks to one employee

I am trying to make a join table, with two columns, employee id and task ids. I select 5 tasks(Ids) and one employee on the client side, send an array of 5 task-Ids and an employee identifier to the server. But these 5 elements are not distributed in different lines.
I want such a result:
Employee_id Task_id
1 4
1 3
1 2
1 1
1 5
Node.js with Express.
Pass the task_list as an array then use the unroll function to break it into individual id.
insert into employee_task(empployee_id, task_id)
select 1, unnest(array[4,3,2,1,5]);

How to get row with largest value?

What I thought would work is:
SELECT *
FROM customer_sale
WHERE sale_date < '2019-02-01'
GROUP BY customer_id
HAVING sale_date = MAX(sale_date)
But running this results in an error
HAVING clause expression references column sale_date which is
neither grouped nor aggregated
Is there another way to achieve this in Spanner? And more generally, why isn't the above allowed?
Edit
Example of data in customer_sale table:
customer_id sale_date
-------------------------------
1 Jan 15
1 Jan 30
1 Feb 2
1 Feb 4
2 Jan 15
2 Feb 2
And the expected result:
customer_id sale_date
-------------------------------
1 Jan 30
2 Jan 15
A HAVING clause in SQL specifies that an SQL SELECT statement should
only return rows where aggregate values meet the specified conditions.
It was added to the SQL language because the WHERE keyword could not
be used with aggregate functions
This is the test table I am using:
index, customer_id, sale_date
1 1 2017-08-25T07:00:00Z
2 1 2017-08-26T07:00:00Z
3 1 2017-08-27T07:00:00Z
4 1 2017-08-28T07:00:00Z
5 2 2017-08-29T07:00:00Z
6 2 2017-08-30T07:00:00Z
With this query:
Select customer_id, max(sale_date) as max_date
from my_test_table
group by customer_id;
I get this result:
customer_id max_date
1 2017-08-28T07:00:00Z
2 2017-08-30T07:00:00Z
Also including where statement:
Select customer_id, max(sale_date) as max_date
from my_test
where sale_date < '2017-08-28'
group by customer_id;
I had the same problem and this way I was able to solve. If you have a quite big table it might take some time.
Basically, joining your normal table with the table which has records with maximum values solves it.
select c.* from
(select * from customer_sale WHERE sale_date < '2019-02-01') c
inner join
(SELECT customer_id, max(sale_date) as max_sale_date FROM customer_sale WHERE
sale_date < '2019-02-01' group by customer_id) max_c
on c.customer_id = max_c.customer_id and c.sale_date = max_c.sale_date

Query optimization in Cassandra

I have a cassandra database that I need to query
My table looks like this:
Cycle Parameters Value
1 a 999
1 b 999
1 c 999
2 a 999
2 b 999
2 c 999
3 a 999
3 b 999
3 c 999
4 a 999
4 b 999
4 c 999
I need to get values for parameters "a" and "b" for two cycles , no matter which "cycle" it is
Example results:
Cycle Parameters Value
1 a 999
1 b 999
2 a 999
2 b 999
or
Cycle Parameters Value
1 a 999
1 b 999
3 a 999
3 b 999
Since the database is quite huge, every query optimization is welcome..
My requirements are:
I want to do everything in 1 query
Would be a plus a answer with no nested query
So far, I was able to accomplish these requirements with something like this:
select * from table where Parameters in ('a','b') sort by cycle, parameters limit 4
However, this query needs a "sortby" operation that causes huge processing in the database...
Any clues on how to do it? ....limit by partition maybe?
EDIT:
The table schema is:
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
"parameters" is the partition key and "cycle" is the clustering column
You can't query like this without ALLOW FILTERING, don't use allow filtering in production Only use it for development!
Read the datastax doc about using ALLOW FILTERING https://docs.datastax.com/en/cql/3.3/cql/cql_reference/select_r.html?hl=allow,filter
I assume your current schema is :
CREATE TABLE data (
cycle int,
parameters text,
value double,
primary key(cycle, parameters)
)
And you need another table or change your table schema to query like these
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
Now you can query
SELECT * FROM cycle_data WHERE parameters in ('a','b');
These result will automatically sorted in ascending order by cycle for every parameters

Resources