Spark SQL Merge query - apache-spark

Spark SQL Merge query - apache-spark

I am trying to MERGE two tables using spark sql and getting error with the statement.
The tables are created as external tables pointing to the Azure ADLS storage. The sql is executing using Databricks.
Table 1:
Name,Age.Sex
abc,24,M
bca,25,F
Table 2:
Name,Age,Sex
abc,25,M
acb,25,F
The Table 1 is the target table and Table 2 is the source table.
In the table 2 I have one Insert and one update record which needs to be merged with source table 1.
Query:
MERGE INTO table1 using table2 ON (table1.name=table2.name)
WHEN MATCHED AND table1.age <> table2.age AND table1.sex<>table2.sex
THEN UPDATE SET table1.age=table2.age AND table1.sex=table2.sex
WHEN NOT MATCHED
THEN INSERT (name,age,sex) VALUES (table2.name,table2.age,table2.sex)
Is the spark SQL support merge or is there another way of achieving it ?
Thanks
Sat

To use MERGE you need the Delta Lake option (and associated jars). Then you can use MERGE.
Otherwise, SQL Merge is not supported by Spark. The Dataframe Writer APIs with own logic are then needed. There are a few different ways to do this. Even with ORC ACID, Spark will not work in this way.

Related

How to copy the records which are not in the target datastore using azure data factory

I have a table in sql and it is copied to ADLS. After copying, sql table got inserted with new rows. I wanted to get the new rows.
I tried to use join transformation. But I couldn't get the output. What is the way to achieve this.

Refer this link. Using this you can get newly added rows from sql to data lake storage. Reproduced issue from my side and able to get newly added records from pipeline.
Created two tables in sql storage with names data_source_table and watermarktable.
data_source_table is the one which is having data in table and watermarktable used for tracking new records based date.
Created pipeline as shown below,
In lookup1 selecting the datasource table
In lookup2 select Query as follows
MAX(LastModifytime) as NewWatermarkvalue from data_source_table;
Then in copy activity source and sink taken as shown below images
SOURCE:
Query in Source:
select `* from data_source_table where LastModifytime > '#{activity('Lookup1').output.firstRow.WatermarkValue}' and LastModifytime <= '#{activity('Lookup1').output.firstRow.Watermarkvalue}'
SINK:
Pipeline ran successfully and data in sql table is loaded into data lake storage file.
Inserted new rows inserted in data_source_table and able to get those records from Lookup activity

Deleting data from Hive managed table (Partitioned and Bucketed)

We have a hive managed table (its both partitioned and bucketed, and transaction = 'true').
We are using Spark (version 2.4) to interact with this hive table.
We are able to successfully ingest data into this table using following;
sparkSession.sql("insert into table values(''))
But we are not able to delete a row from this table. We are attempting to delete using below command;
sparkSession.sql("delete from table where col1 = '' and col2 = '')
We are getting operationNotAccepted exception.
Do we need to do anything specific to be able to perform this action?
Thanks
Anuj

Unless DELTA table, this is not possible.
ORC does not support delete for Hive bucketed tables. See https://github.com/qubole/spark-acid
HUDI on AWS could also be an option.

Azure Data factory - insert a row in the azure sql database only if it doesn't exist

I have an excel file as source that needs to be copied into the Azure SQL database using Azure Data Factory.
The ADF pipeline needs to copy the rows from the excel source to SQL database only if it is already not existing in the database. If it exists in the SQL database then no action needs to be taken.
looking forward to the best optimized solution.

You can achieve it using Azure data factory data flow by joining source and sink data and filter the new insert rows to insert if the row does not exist in the sink database.
Example:
Connect excel source to source transformation in the data flow.
Source preview:
You can transform the source data if required using the derived column transformation. This is optional.
Add another source transformation and connect it with the sink dataset (Azure SQL database). Here in the Source option, you can select a table if you are comparing all columns of the sink dataset with the source dataset, or you can select query and write the query to select only matching columns.
Source2 output:
Join source1 and source2 transformations using the Join transformation with join type as Left outer join and add the Join conditions based on the requirement.
Join output:
Using filter transformation, filter out the existing rows from the join output.
Filter condition: isNull(source2#Id)==true()
Filter output:
Using the Select transformation, you can remove the duplicate columns (like source2 columns) from the list. You can also do this in sink mapping by editing manually and deleting the duplicate rows.
Add sink and connect to sink dataset (azure SQL database) to get the required output.

You should create this using a Copy activity and a stored procedure as the Sink. Write code in the stored proc (eg MERGE or INSERT ... WHERE NOT EXISTS ...) to handle the record existing or not existing.
An example of a MERGE proc from the documentation:
CREATE PROCEDURE usp_OverwriteMarketing
#Marketing [dbo].[MarketingType] READONLY,
#category varchar(256)
AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING #Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = #category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END
This article runs through the process in more detail.

Delta tables in Databricks and into Power BI

I am connecting to a delta table in Azure gen 2 data lake by mounting in Databricks and creating a table ('using delta'). I am then connecting to this in Power BI using the Databricks connector.
Firstly, I am unclear as to the relationship between the data lake and the Spark table in Databricks. Is it correct that the Spark table retrieves the latest snapshot from the data lake (delta lake) every time it is itself queried? Is it also the case that it is not possible to effect changes in the data lake via operations on the Spark table?
Secondly, what is the best way to reduce the columns in the Spark table (ideally before it is read into Power BI)? I have tried creating the Spark table with specified subset of columns but get a cannot change schema error. Instead I can create another Spark table that selects from the first Spark table, but this seems pretty inefficient and (I think) will need to be recreated frequently in line with the refresh schedule of the Power BI report. I don't know if it's possible to have a Spark delta table that references another Spark Delta table so that the former is also always the latest snapshot when queried?
As you can tell, my understanding of this is limited (as is the documentation!) but any pointers very much appreciated.
Thanks in advance and for reading!

Table in Spark is just a metadata that specify where the data is located. So when you're reading the table, Spark under the hood just looking up in the metastore for information where data is stored, what schema, etc., and access that data. Changes made on the ADLS will be also reflected in the table. It's also possible to modify table from the tools, but it depends on what access rights are available to the Spark cluster that processes data - you can set permissions either on the ADLS level, or using table access control.
For second part - you just need to create a view over the original table, and that view will select only limited set of columns - the data is not copied and latest updates in the original table will be always available for querying. Something like:
CREATE OR REPLACE VIEW myview
AS SELECT col1, col2 FROM mytable
P.S. If you're only accessing via PowerBI or other BI tools, you may look onto Databricks SQL (when it will be in the public preview) that is heavily optimized for BI use cases.

Is there way a to use join query in Azure Data factory When copying data from Sybase source

I am trying to ingest data from Sybase source in to Azure datalake. I am ingesting several tables using a Watermark table that has tables names from Sybase source. Now process works fine for a full import, however we are trying to Import tables every 15 minutes to feed a dashboard. We don't need to ingest whole table as we don't need all the data from it.
Table doesn't have dateModified or any kind of incremental id to perform an incremental load. Only way of filtering out unwanted data is to perform a join on to another look up table at source and then using "filter" value in "Where" clause.
Is there a way we can perform this in Azure data factory ? I have attached my current pipeline screenshot just to make it a bit more clear.

Many thanks for looking in to this. I have managed to find a solution. I was using a Watermark table to ingest about 40 tables using one pipeline. My only issue was how to use join and "where" filter in my query without hard coding it in pipeline. I have achieved this by adding "Join" and "Where" fields in my Watermark table and then passing it in "Query" as #{item ().Join} #{item().Where). It Worked like a magic.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark SQL Merge query - apache-spark

To use MERGE you need the Delta Lake option (and associated jars). Then you can use MERGE. Otherwise, SQL Merge is not supported by Spark. The Dataframe Writer APIs with own logic are then needed. There are a few different ways to do this. Even with ORC ACID, Spark will not work in this way.

Related

How to copy the records which are not in the target datastore using azure data factory

Deleting data from Hive managed table (Partitioned and Bucketed)

Azure Data factory - insert a row in the azure sql database only if it doesn't exist

Delta tables in Databricks and into Power BI

Is there way a to use join query in Azure Data factory When copying data from Sybase source

Categories

Resources