How to REPLACE INTO using MERGE INTO for upsert in Delta Lake? - apache-spark

The recommended way of doing an upsert in a delta table is the following.
MERGE INTO users
USING updates
ON users.userId = updates.userId
WHEN MATCHED THEN
UPDATE SET address = updates.addresses
WHEN NOT MATCHED THEN
INSERT (userId, address) VALUES (updates.userId, updates.address)
Here updates is a table. My question is how can we do an upsert directly, that is, without using a source table. I would like to give the values myself directly.
In SQLite, we could simply do the following.
REPLACE INTO table(column_list)
VALUES(value_list);
Is there a simple way to do that for Delta tables?

A source table can be a subquery so the following should give you what you're after.
MERGE INTO events
USING (VALUES(...)) // round brackets are required to denote a subquery
ON false // an artificial merge condition
WHEN NOT MATCHED THEN INSERT *

Related

Power Query how to make a Table with multiple values a parameter that uses OR

I have a question regarding Power Query and Tables as parameters for excel.
Right now I can create a table and use it as a parameter for Power query via Drill down.
But I'm unsure how i would proceed with a Table that has multiple values. How can a table be recognized with multiple "values" as a parameter
For example:
I have the following rawdata and parameter tables
Rawdata+parametertables
Now if I wanted to filter after Value2 with a parameter tables I would do a drill down of the parameter tables and load them to excel.
After that I have two tables that I can filter Value2 with an OR Function by 1 and 2
Is it possible to somehow combine this into 1 Table and that it still uses an OR Function to search
Value2
Im asking because I want it to be potentially possible to just add more and more parameters into the table without creating a new table everytime. Basically just copy paste some parameters into the parameter table and be done with it
Thanks for any help in advance
Assuming, you use Parameters only for filtering. There are other ways, but this one looks the best from performance point of view.
You may create Parameters table, so you have such tables:
Note, it's handy to have the same names (Value2) for key column in both tables, otherwise Table.Join will create additional column(s) after merging tables.
Add similar step to filter RawData table:
join = Table.Join(RawData, "Value2", Parameters, "Value2")

how Cql's Collection contains alternative value?

I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.

U-Sql view to merge duplicates via ranking

I have data lying in multiple files with naming convention as {year}/{month}/{date} which have duplicates (every day delta where records may get updated everyday).
I want to create a view that will return the records with the duplicates merged / squashed.
The duplicates will be ranked and only the latest updated records corresponding to each primary key will be returned.
But the use of rowsets in view seems to be not supported. Basically something like this:
CREATE VIEW viewname AS
#sourcedata = EXTRACT //schema
from //filenamePattern (regex)
using Extractors.TSV()
#sourceData = SELECT *,
ROW_NUMBER() OVER(PARTITION BY primary_Key ORDER BY timestamp DESC) AS RowNumber FROM #SourceData;
SELECT //schema
from #sourceData WHERE RowNumber == 1
So that when I do
select * from viewname
I get the merged data directly from the underlying files. How to achieve this ?
It is possible to have multiple EXTRACT statements in a view stacked together with a UNION statement which would implicitly remove duplicates. However is there any particular reason you need to use a view? This will limit your options as you will have to code within the limitations of views (eg they can't be parameterised). You could also use table-valued function, stored procedure or just a plain old script. This would give you many more options, especially if your de-duplication logic is complex. A simple example:
DROP VIEW IF EXISTS vw_removeDupes;
CREATE VIEW vw_removeDupes
AS
EXTRACT someVal int
FROM "/input/input59a.txt"
USING Extractors.Tsv()
UNION
EXTRACT someVal int
FROM "/input/input59b.txt"
USING Extractors.Tsv();
I think it can be solved by table valued function. Have you tried using it?
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-functions

node.js and postgres bulk upsert or another pattern?

I am using Postgres, NodeJS and Knex.
I have the following situation:
A database table with a unique field.
In NodeJS I have an array of objects and I need to:
a. Insert a new row, if the table does not contain the unique id, or
b. Update the remaining fields, if the table does contain the unique id.
From my knowledge I have three options:
Do a query to check for each if exists in database and based on the response, do a update or insert. This costs resources because there's a call for each array item and also a insert or update.
Delete all rows that have id in array and then perform a insert. This would mean only 2 operations but the autoincrement field will keep on growing.
Perform an upsert since Postgres 9.5 supports it. Bulk upsert seems to work and there's only a call to database.
Looking through the options I am aware of, upsert seems the most reasonable one but does it have any drawbacks?
Upsert is a common way.
Another way is use separate insert/update operations and most likely it will be faster:
Define existing rows
select id from t where id in (object-ids) (*)
Update existing row by (*) result
Filter array by (*) and bulk insert new rows.
See more details for same question here

Cassandra : Using output of one query as input to another query

I have two tables one is users and other is expired_users.
users columns-> id, name, age
expired_users columns -> id, name
I want to execute the following query.
delete from users where id in (select id from expired_users);
This query works fine with SQL related databases. I want find a solution to solve this in cassandra.
PS: I don't want to add any extra columns in the tables.
While designing cassandra data model, we cannot think exactly like RDBMS .
Design like this --
create table users (
id int,
name text,
age int,
expired boolean static,
primary key (id,name)
);
To mark a user as expired -- Just insert the same row again
insert into users (id,name,age,expired) values (100,'xyz',80,true);
you don't have to update or delete the row, just insert it again, previous column values will get overridden.
What you want to is to use join as a filter for your delete statement, and this is not what the Cassandra model is built for.
AFAIK there is no way to perform this using cql. If you want to perform this action without changing the schema - run external script in any language that has drivers for Cassandra.

Resources