Executing DDL statements inside Spark is slow and not in order - apache-spark

I'm trying to execute multiples commands with cqlSession inside spark, but the commands is very slow and didn't respect the order and fail to create the table. i need to wait delete to create a new table.
delete:
connector.withSessionDo(cqlSession => cqlSession.execute("DROP TABLE IF EXISTS person"))
create:
connector.withSessionDo(cqlSession => cqlSession.execute("CREATE TABLE IF EXISTS person(id text)"))

Related

Spark - nondeterministic expressions not allowed using Delete FROM and exists

The goal of this query is to delete the records that have the same keys of the new dataframe.
This is the query that I'm executing:
val op=spark.sql(s"""
DELETE
FROM TABLE1 AS t
WHERE EXISTS (
SELECT 1
FROM TABLE2 AS s
WHERE t.DAY=s.DAY
AND t.DATA_STREAM=s.DATA_STREAM
)""")
The error that I'm getting is:
AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
exists(t.DAY, t.DATA_STREAM)
in operator DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]
;
DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]```
As far as I know exists is deterministic. Is there another way to achieve the same result with another query?
The query is running on a Databricks cluster with Spark 3.2.1

Running multiple SQL statements using Boto3 and AWS Glue

I would like to run multiple SQL statements in a single AWS Glue script using boto3.
The first query creates a table from S3 bucket (parquet files)
import boto3
client = boto3.client('athena')
config = {'OutputLocation': 's3://LOGS'}
client.start_query_execution(QueryString =
"""CREATE EXTERNAL TABLE IF NOT EXISTS my_database_name.my_table (
'apples' string,
'oranges' string,
'price' int
) PARTITIONED BY (
update_date string
)
STORED AS PARQUET
LOCATION 's3://LOCATION'
TBLPROPERTIES ('parquet.compression' = 'SNAPPY');""",
QueryExecutionContext = {'Database': 'my_database_name'},
ResultConfiguration = config)
This only creates the table. Then I have to run the following query in order to update the partitions and insert the data.
client.start_query_execution(QueryString =
"""MSCK REPAIR TABLE my_database_name.my_table;""",
QueryExecutionContext = {'Database': 'my_database_name'},
ResultConfiguration = config)
Unfortunately, when I run the above statements in a single GLUE script, the partitions are not updated (only the table is created). I have to separate them in two jobs.
Is it possible to have a single scripts that can execute multiple queries in a sequence?
Using Glue Crawlers is not an option
You should explore the alternative of using partition projection which removes the need of loading partition via repair table or crawlers. See the docs: https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html

Jdbc update statement in spark

I am connected to a database using JDBC and I am trying to run an update query. First I am typing the query, then I am executing it (in the same way I do the SELECT which works perfectly fine).
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES') alias_output "
spark.read.jdbc(url=jdbcUrl, table=caseoutputUpdateQuery, properties=connectionProperties)
When I run this I have the following error:
A nested INSERT, UPDATE, DELETE, or MERGE statement must have an OUTPUT clause.
I tried to fix this in different ways but there is always another error. For example, I tried to rewrite the query in the following way:
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT DELETED.*, INSERTED.* FROM dbo.CASEOUTPUT_TEST) alias_output "
but I encounter this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement.
The other way I tried to rewrite it was:
caseoutputUpdateQuery = "(INSERT INTO dbo.UpdateOutput(OldCaseID,NotifiedOld) SELECT * FROM( UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT deleted.OldCaseID,DELETED.NotifiedOld ) AS tbl) alias_output "
but I've got this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed inside another nested INSERT, UPDATE, DELETE, or MERGE statement.
I've literally tried everything I found on the internet but without luck. Do you have any suggestion on how I can fix this and run my update statement?
I think Spark is not designed for that UPDATE statement use case. That's not the scenario where Spark can help to deal with RDBMS. I suggest to use a direct connection using a JDBC from the code you are writing (I mean calling that JDBC directly). If you are using Scala you can use as suggested here (for example, but there are other multiple ways) or from Python as explained here. Those samples reach Oracle engine, but please change the driver/connector if you are using MySQL, SQL Server, Postgres or any other RDMBS
spark.read under the covers does a select * from the source jdbc table. if you pass a query, spark translates it to
select your query
from ( their query select *)
Sql complains because you are trying to do an update on a view "select * from"

ADF copy data activity - check for duplicate records before inserting into SQL db

I have a very simple ADF pipeline to copy data from local mongoDB (self-hosted integration environment) to Azure SQL database.
My pipleline is able to copy the data from mongoDB and insert into SQL db.
Currently if I run the pipeline it inserts duplicate data if run multiple times.
I have made _id column as unique in SQL database and now running pipeline throws and error because of SQL constraint wont letting it insert the record.
How do I check for duplicate _id before inserting into SQL db?
should I use Pre-copy script / stored procedure?
Some guidance / directions would be helpful on where to add extra steps. Thanks
Azure Data Factory Data Flow can help you achieve that:
You can follow these steps:
Add two sources: Cosmos db table(source1) and SQL database table(source2).
Using Join active to get all the data from two tables(left join/full join/right join) on Cosmos table.id= SQL table.id.
AlterRow expression to filter the duplicate _id, it not duplicate then insert it.
Then mapping the no-duplicate column to the Sink SQL database table.
Hope this helps.
You Should implement your SQL Logic to eliminate duplicate at the Pre-Copy Script
Currently I got the solution using a Stored Procedure which look like a lot less work as far this requirement is concerned.
I have followed this article:
https://www.cathrinewilhelmsen.net/2019/12/16/copy-sql-server-data-azure-data-factory/
I created table type and used in stored procedure to check for duplicate.
my sproc is very simple as shown below:
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[spInsertIntoDb]
(#sresults dbo.targetSensingResults READONLY)
AS
BEGIN
MERGE dbo.sensingresults AS target
USING #sresults AS source
ON (target._id = source._id)
WHEN NOT MATCHED THEN
INSERT (_id, sensorNumber, applicationType, place, spaceType, floorCode, zoneCountNumber, presenceStatus, sensingTime, createdAt, updatedAt, _v)
VALUES (source._id, source.sensorNumber, source.applicationType, source.place, source.spaceType, source.floorCode,
source.zoneCountNumber, source.presenceStatus, source.sensingTime, source.createdAt, source.updatedAt, source.updatedAt);
END
I think using stored proc should do for and also will help in future if I need to do more transformation.
Please let me know if using sproc in this case has potential risk in future ?
To remove the duplicates you can use the pre-copy script. OR what you can do is you can store the incremental or new data into a temp table using copy activity and use a store procedure to delete only those Ids from the main table which are in temp table after deletion insert the temp table data into the main table. and then drop the temp table.

Is there a way to apply a loop in cql file cassandra?

I have to run the following insert query 1000 times in BATCH just after loading a schema.
INSERT INTO keyspace.messages (messageid, message) VALUES
(uuid(), 'random');
My current implementation is a radom.cql file, which has 1000 entries like the script below. And then I use SOURCE command to apply them after my schema upload.
BEGIN BATCH
INSERT INTO keyspace.messages (messageid, message) VALUES (uuid(), 'random');
INSERT INTO keyspace.messages (messageid, message) VALUES (uuid(), 'random');
INSERT INTO keyspace.messages (messageid, message) VALUES (uuid(), 'random');
...till 1000 times
APPLY BATCH;
Is there any better way to achieve the same result?
Cassandra doesn't have any PL/SQL constructs or stored procedures yet, so it's impossible.
You have to do it from application side and batch doesn't help in this scenario and is a bad way of using it.

Resources