Spark - nondeterministic expressions not allowed using Delete FROM and exists - apache-spark

The goal of this query is to delete the records that have the same keys of the new dataframe.
This is the query that I'm executing:
val op=spark.sql(s"""
DELETE
FROM TABLE1 AS t
WHERE EXISTS (
SELECT 1
FROM TABLE2 AS s
WHERE t.DAY=s.DAY
AND t.DATA_STREAM=s.DATA_STREAM
)""")
The error that I'm getting is:
AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
exists(t.DAY, t.DATA_STREAM)
in operator DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]
;
DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]```
As far as I know exists is deterministic. Is there another way to achieve the same result with another query?
The query is running on a Databricks cluster with Spark 3.2.1

Related

Cosmos DB spatial query using Spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600
Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

Jdbc update statement in spark

I am connected to a database using JDBC and I am trying to run an update query. First I am typing the query, then I am executing it (in the same way I do the SELECT which works perfectly fine).
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES') alias_output "
spark.read.jdbc(url=jdbcUrl, table=caseoutputUpdateQuery, properties=connectionProperties)
When I run this I have the following error:
A nested INSERT, UPDATE, DELETE, or MERGE statement must have an OUTPUT clause.
I tried to fix this in different ways but there is always another error. For example, I tried to rewrite the query in the following way:
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT DELETED.*, INSERTED.* FROM dbo.CASEOUTPUT_TEST) alias_output "
but I encounter this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement.
The other way I tried to rewrite it was:
caseoutputUpdateQuery = "(INSERT INTO dbo.UpdateOutput(OldCaseID,NotifiedOld) SELECT * FROM( UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT deleted.OldCaseID,DELETED.NotifiedOld ) AS tbl) alias_output "
but I've got this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed inside another nested INSERT, UPDATE, DELETE, or MERGE statement.
I've literally tried everything I found on the internet but without luck. Do you have any suggestion on how I can fix this and run my update statement?
I think Spark is not designed for that UPDATE statement use case. That's not the scenario where Spark can help to deal with RDBMS. I suggest to use a direct connection using a JDBC from the code you are writing (I mean calling that JDBC directly). If you are using Scala you can use as suggested here (for example, but there are other multiple ways) or from Python as explained here. Those samples reach Oracle engine, but please change the driver/connector if you are using MySQL, SQL Server, Postgres or any other RDMBS
spark.read under the covers does a select * from the source jdbc table. if you pass a query, spark translates it to
select your query
from ( their query select *)
Sql complains because you are trying to do an update on a view "select * from"

Why does 'get_json_object' return different results when run in spark and sql tool

I have developed a hive query that uses lateral views and get_json_object to unpack some json. The query works well enough using a jdbc client (dbvisualizer) against a hive database but when run as spark sql from a java application, on the same data, it returns nothing.
I have tracked down the problem to differences in what the function 'get_json_object' returns.
The issue can be illustrated by this type of query
select concat_ws( "|", get_json_object('{"product_offer":[
{"productName":"Plan A"},
{"productName":"Plan B"}]}',
'$.product_offer.productName') )
When run in dbvisualizer against a Hive database I get an array of the 2 product names in the json array: ["Plan A","Plan B"].
When the same query is run as spark sql from a java application, null is returned.
I have noticed another difference: the path '$.product_offer[0].productName' returns 'Plan A' in db visualizer and nothing in spark.
The path to extract the array of product names is
select concat_ws( "|", get_json_object('{"product_offer":[{"productName":"Plan A"},{"productName":"Plan B"}]}', '$.product_offer[*].productName'
which works both in spark dbvisualizer.

Rebuild index failed on Hive on Azure HDInsight with Tez

I try to create indexes on Hive on Azure HDInsight with Tez enabled.
I can successfully create indexes but I can't rebuild them : the job failed with this output :
Map 1: -/- Reducer 2: 0/1
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1421234198072_0091_1_01, diagnostics=[Vertex Input: measures initializer failed.]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1421234198072_0091_1_00, diagnostics=[Vertex > received Kill in INITED state.]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
I have created my table and indexes with the following job :
DROP TABLE IF EXISTS Measures;
CREATE TABLE Measures(
topology string,
val double,
date timestamp,
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE LOCATION 'wasb://<mycontainer>#<mystorage>.blob.core.windows.net/';
CREATE INDEX measures_index_topology ON TABLE Measures (topology) AS 'COMPACT' WITH DEFERRED REBUILD;
CREATE INDEX measures_index_date ON TABLE Measures (date) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX measures_index_topology ON Measures REBUILD;
ALTER INDEX measures_index_date ON Measures REBUILD;
Where am I wrong ? And why my rebuilding index fail ?
Best regards
It looks like Tez might have a problem with generating an index on an empty table. I was able to get the same error as you (without using the JSON SerDe), and if you look at the application logs for the DAG that fails, you might see something like:
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:254)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:299)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getSplits(TezGroupedSplitsInputFormat.java:68)
at org.apache.tez.mapreduce.hadoop.MRHelpers.generateOldSplits(MRHelpers.java:263)
at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:139)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:154)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:146)
...
If you populate the table with a single dummy record, it seems to work fine. I used:
INSERT INTO TABLE Measures SELECT market,0,0 FROM hivesampletable limit 1;
After that, the index rebuild was able to run without error.

Nested query not working in Cassandra

USE users_tracking;
SELECT user_name FROM visits
where port_name IN
(SELECT port_name FROM ports where location = 'NY' )//as temp;
It gives an error
mismatched input 'SELECT' expecting RULE_T_R_PAREN
Is there any way I can store the inner query in a variable and then use that?
I tried using set#varname := query but it does not recognize the set command.
Nested queries are not allowed in Cassandra CQL. For this kind of complex querying feature you'll need to use Hive or SparkSQL.
Here is a full CQL reference,
http://cassandra.apache.org/doc/cql3/CQL.html

Resources