ARRAY_AGG function does not work in Spark SQL - apache-spark

I a trying to use ARRAY_AGG function in Spark SQL. When I use it, it throws error
<<Undefined function: 'array_agg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default>>
Dataset<Row> finalDS1 = sparkSession.sql("select array_agg(company_private_id) from TEMP_COMPANY_PRIVATE_VIEW");
Anyone know how to solve it? I am trying to compare one array with another column. For that I am using ARRAY_AGG.
"select cp.array_column & (select array_agg(int_column) from getCompanyPrivateDS ds1) as filtered_data from getCompanyPrivateDS cp"

I think this is a documentation error by Spark. They clearly show array_agg() in their function list: https://spark.apache.org/docs/latest/api/sql/index.html#array_agg
but I have also experienced that this function doesn't work on Spark 3.1.2
Collect_set() and collect_list() should work for your purposes: the former dedupes results, while the latter doesn't.

Related

Cosmos DB spatial query using Spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600
Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

Jdbc update statement in spark

I am connected to a database using JDBC and I am trying to run an update query. First I am typing the query, then I am executing it (in the same way I do the SELECT which works perfectly fine).
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES') alias_output "
spark.read.jdbc(url=jdbcUrl, table=caseoutputUpdateQuery, properties=connectionProperties)
When I run this I have the following error:
A nested INSERT, UPDATE, DELETE, or MERGE statement must have an OUTPUT clause.
I tried to fix this in different ways but there is always another error. For example, I tried to rewrite the query in the following way:
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT DELETED.*, INSERTED.* FROM dbo.CASEOUTPUT_TEST) alias_output "
but I encounter this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement.
The other way I tried to rewrite it was:
caseoutputUpdateQuery = "(INSERT INTO dbo.UpdateOutput(OldCaseID,NotifiedOld) SELECT * FROM( UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT deleted.OldCaseID,DELETED.NotifiedOld ) AS tbl) alias_output "
but I've got this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed inside another nested INSERT, UPDATE, DELETE, or MERGE statement.
I've literally tried everything I found on the internet but without luck. Do you have any suggestion on how I can fix this and run my update statement?
I think Spark is not designed for that UPDATE statement use case. That's not the scenario where Spark can help to deal with RDBMS. I suggest to use a direct connection using a JDBC from the code you are writing (I mean calling that JDBC directly). If you are using Scala you can use as suggested here (for example, but there are other multiple ways) or from Python as explained here. Those samples reach Oracle engine, but please change the driver/connector if you are using MySQL, SQL Server, Postgres or any other RDMBS
spark.read under the covers does a select * from the source jdbc table. if you pass a query, spark translates it to
select your query
from ( their query select *)
Sql complains because you are trying to do an update on a view "select * from"

Why does 'get_json_object' return different results when run in spark and sql tool

I have developed a hive query that uses lateral views and get_json_object to unpack some json. The query works well enough using a jdbc client (dbvisualizer) against a hive database but when run as spark sql from a java application, on the same data, it returns nothing.
I have tracked down the problem to differences in what the function 'get_json_object' returns.
The issue can be illustrated by this type of query
select concat_ws( "|", get_json_object('{"product_offer":[
{"productName":"Plan A"},
{"productName":"Plan B"}]}',
'$.product_offer.productName') )
When run in dbvisualizer against a Hive database I get an array of the 2 product names in the json array: ["Plan A","Plan B"].
When the same query is run as spark sql from a java application, null is returned.
I have noticed another difference: the path '$.product_offer[0].productName' returns 'Plan A' in db visualizer and nothing in spark.
The path to extract the array of product names is
select concat_ws( "|", get_json_object('{"product_offer":[{"productName":"Plan A"},{"productName":"Plan B"}]}', '$.product_offer[*].productName'
which works both in spark dbvisualizer.

Spark DataFrame Filter using Binary (Array[Bytes]) data

I have a DataFrame from a JDBC table hitting MySql and I need to filter it using a UUID. The data is stored in MySql using binary(16) and when querying out in spark is converted to Array[Byte] as expected.
I'm new to spark and have been trying various ways to pass a variable of type UUID into the DataFrame's filter method.
Ive tried statements like
val id: UUID = // other logic that looks this up
df.filter(s"id = $id")
df.filter("id = " convertToByteArray(id))
df.filter("id = " convertToHexString(id))
All of these error with different messages.
I just need to somehow pass in Binary types but can't seem to put my finger on how to do so properly.
Any help is greatly appreciated.
After reviewing even more sources online, I found a way to accomplish this without using the filter method.
When I'm reading from my sparkSession, I just use an adhoc table instead of table name, as follows:
sparkSession.read.jdbc(connectionString, s"(SELECT id, {other col omitted) FROM MyTable WHERE id = 0x$id) AS MyTable", props)
This pre-filters the results for me and then I just work with the data frame as I need.
If anyone knows of a solution using filter, I'd still love to know it as that would be useful in some cases.

Can't access delete method on Slick query

This is very very very frustrating. I have been trying to pick up Slick for a while, and obstacles just keep coming. The concept of Slick is really awesome, but it is very difficult to learn, and unlike Scala, it doesn't have "beginner", "intermediate", and "advanced" style where people in all stages can use it easily.
I'm using Play-Slick (Slick 2.0.0) https://github.com/freekh/play-slick, following its Multi-DB cake example: https://github.com/freekh/play-slick/tree/master/samples/play-slick-cake-sample/app
For some reason, first, ddl does not belong to TableQuery, unlike the claim in the document: "The TableQuery‘s ddl method creates DDL". This shows through the scaladoc: http://slick.typesafe.com/doc/2.0.0/api/#scala.slick.lifted.TableQuery There is no ddl method there.
Second, my slick.lifted.Query can't generate delete method. It works fine with list, but not with delete.
val S3Files = TableQuery[S3Files]
S3Files.where(_.url === url).delete
This wouldn't work...then I tried:
val query = (for(s <- S3Files if s.url === url) yield s)
query.list //this works
query.delete //ehh?? can't find the method
val query2 = (for(s <- S3Files if s.url === url))
query2.delete //still won't work
Well...since Slick uses a very complicated (at least to newbies) implicit type conversion system, I don't really know what went wrong.
I tried it by simply adding
Cats.ddl.create
Cats.filter(_.name===cat.name).delete
to play-slick-cake-sample/app/controllers/Application.scala. Works fine for me.
Looks like you are using the wrong imports. Look at https://github.com/freekh/play-slick/blob/master/samples/play-slick-sample/app/controllers/Application.scala and mimic the imports.
slick 0.8.1 and slick 2.1.0 and I had the same Issue.
The reason why delete is not available on the Query is cause the play-slick Query does not contain a equivalent method of the delete method from slick Query.
I solved this Problem by changing to the original slick Driver
//import play.api.db.slick.Config.driver.simple._ //play-slick extensional Driver
import slick.driver.PostgresDriver.simple._ //original slick Driver

Resources