Why SparkSession.sql() is not for SELECT queries? - apache-spark

I‘m pretty curious about that why SparkSession says sql() is not for SELECT Command.
Is there any problem if I insist on doing so?
Thanks for your reply!
/**
* Executes a SQL query using Spark, returning the result as a `DataFrame`.
* This API eagerly runs DDL/DML commands, but not for SELECT queries.
*
* #since 2.0.0
*/
def sql(sqlText: String): DataFrame = withActive {
val tracker = new QueryPlanningTracker
val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
sessionState.sqlParser.parsePlan(sqlText)
}
Dataset.ofRows(self, plan, tracker)
}

I think the docs mean that it eagerly runs DDL/DML commands but it does not eagerly run SELECT queries. That's the nature of Spark's lazy evaluation - it never runs SELECT queries eagerly because they are transformations; it will only include it in a query plan until you call an action.
However, DDL/DML commands are actions, so they will be run eagerly instead.
So, to answer your question, it's totally fine to use spark.sql to run SELECT queries. It will return a dataframe for the results of the query.

Related

ARRAY_AGG function does not work in Spark SQL

I a trying to use ARRAY_AGG function in Spark SQL. When I use it, it throws error
<<Undefined function: 'array_agg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default>>
Dataset<Row> finalDS1 = sparkSession.sql("select array_agg(company_private_id) from TEMP_COMPANY_PRIVATE_VIEW");
Anyone know how to solve it? I am trying to compare one array with another column. For that I am using ARRAY_AGG.
"select cp.array_column & (select array_agg(int_column) from getCompanyPrivateDS ds1) as filtered_data from getCompanyPrivateDS cp"
I think this is a documentation error by Spark. They clearly show array_agg() in their function list: https://spark.apache.org/docs/latest/api/sql/index.html#array_agg
but I have also experienced that this function doesn't work on Spark 3.1.2
Collect_set() and collect_list() should work for your purposes: the former dedupes results, while the latter doesn't.

Cosmos DB spatial query using Spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600
Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

Pyspark trying to write to DB2 table - truncate overwrite

I am trying to write the data to IBM DB2 (10.5 fix pack 11) using Pyspark (2.4).
When I try to execute below piece of code
df.write.format("jdbc")
.mode('overwrite').option("url",'jdbc:db2://<host>:<port>/<DB>').
option("driver", 'com.ibm.db2.jcc.DB2Driver').
option('sslConnection', 'true')
.option('sslCertLocation','</location/***_ssl.crt?').
option("numPartitions", 1).
option("batchsize", 1000)
.option('truncate','true').
option("dbtable", '<TABLE>').
option("user",'<user>').
option("password", '<PW>')
.save()
job is throwing the following exception:
File
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o97.save. :
com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error:
SQLCODE=-104, SQLSTATE=42601,
SQLERRMC=END-OF-STATEMENT;ABLE<SEHEMA.TABLE>;IMMEDIATE, DRIVER=4.19.80
at com.ibm.db2.jcc.am.b5.a(b5.java:747)
Job is trying to perform truncate but seems like DB2 is expecting ** IMMEDIATE** keyword
In my above code all I am passing is only name of the dbtable, is there a way to pass
IMMEDIATE keyword?
And also from DB2 side, is there a way to set this while opening the session?
Just FYI, my code with out truncate works, but that delete the table and recreates and loads, I don't want to do that on prod environment.
Any thoughts on how to solve this issue are highly appreciated.
DB2Dialect in Spark 2.4 doesn't override the default JDBCDialect's implementation of a TRUNCATE TABLE. Comments in the code suggest to override this method to return a statement that suits your database engine.
/**
* The SQL query that should be used to truncate a table. Dialects can override this method to
* return a query that is suitable for a particular database. For PostgreSQL, for instance,
* a different query is used to prevent "TRUNCATE" affecting other tables.
* #param table The table to truncate
* #param cascade Whether or not to cascade the truncation
* #return The SQL query to use for truncating a table
*/
#Since("2.4.0")
def getTruncateQuery(
table: String,
cascade: Option[Boolean] = isCascadingTruncateTable): String = {
s"TRUNCATE TABLE $table"
}
Perhaps in DB2 case you can actually extend DB2Dialect itself, add your getTruncateQuery() implementation and define your "custom" JDBC protocol, "jdbc:mydb2" for example. You can then use this protocol in JDBC connection URL, .option("url",'jdbc:mydb2://<host>:<port>/<DB>').

Jdbc update statement in spark

I am connected to a database using JDBC and I am trying to run an update query. First I am typing the query, then I am executing it (in the same way I do the SELECT which works perfectly fine).
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES') alias_output "
spark.read.jdbc(url=jdbcUrl, table=caseoutputUpdateQuery, properties=connectionProperties)
When I run this I have the following error:
A nested INSERT, UPDATE, DELETE, or MERGE statement must have an OUTPUT clause.
I tried to fix this in different ways but there is always another error. For example, I tried to rewrite the query in the following way:
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT DELETED.*, INSERTED.* FROM dbo.CASEOUTPUT_TEST) alias_output "
but I encounter this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement.
The other way I tried to rewrite it was:
caseoutputUpdateQuery = "(INSERT INTO dbo.UpdateOutput(OldCaseID,NotifiedOld) SELECT * FROM( UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT deleted.OldCaseID,DELETED.NotifiedOld ) AS tbl) alias_output "
but I've got this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed inside another nested INSERT, UPDATE, DELETE, or MERGE statement.
I've literally tried everything I found on the internet but without luck. Do you have any suggestion on how I can fix this and run my update statement?
I think Spark is not designed for that UPDATE statement use case. That's not the scenario where Spark can help to deal with RDBMS. I suggest to use a direct connection using a JDBC from the code you are writing (I mean calling that JDBC directly). If you are using Scala you can use as suggested here (for example, but there are other multiple ways) or from Python as explained here. Those samples reach Oracle engine, but please change the driver/connector if you are using MySQL, SQL Server, Postgres or any other RDMBS
spark.read under the covers does a select * from the source jdbc table. if you pass a query, spark translates it to
select your query
from ( their query select *)
Sql complains because you are trying to do an update on a view "select * from"

Query Spark SQL from Node.js server

I'm currently using npm's cassandra-driver to query my Cassandra database from a Node.js server. Since I want to be able to write more complex queries, I'd like to use Spark SQL instead of CQL. Is there any way to create a RESTful API (or something else) so that I can use Spark SQL the same way that I currently use CQL?
In other words, I want to be able to send a Spark SQL query from my Node.js server to another server and get a result back.
Is there any way to do this? I've been searching for solutions to this problem for a while and haven't found anything yet.
Edit: I'm able to query my database with Scala and Spark SQL from the Spark shell, so that bit is working. I just need to connect Spark and my Node.js server somehow.
I had a similar problem, and I solved by using Spark-JobServer.
The main approach with Spark-Jobserver (SJS) usually is to create a special job that extends their SparkSQLJob such as in the following example:
object ExecuteQuery extends SparkSQLJob {
override def validate(sqlContext: SQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(sqlContext: SQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = sqlContext.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
However, for this approach to work well with Cassandra, you have to use the spark-cassandra-connector and then, to load the data you will have two options:
1) Before calling this ExecuteQuery via REST, you have to transfer the full data you want to query from Cassandra to Spark. For that, you would do something like (code adapted from the spark-cassandra-connector documentation):
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
And then register it as a table in order to SparkSQL be able to access it:
df.registerAsTempTable("myTable") // As a temporary table
df.write.saveAsTable("myTable") // As a persistent Hive Table
Only after that you would be able to use the ExecuteQuery to query from myTable.
2) As the first option can be inefficient in some use cases, there is another option.
The spark-cassandra-connector has a special CassandraSQLContext that can be used to query C* tables directly from Spark. It can be used like:
val cc = new CassandraSQLContext(sc)
val df = cc.sql("SELECT * FROM keyspace.table ...")
However, to use a different type of context with Spark-JobServer, you need to extend SparkContextFactory and use it in the moment of context creation (which can be done by a POST request to /contexts). An example of a special context factory can be seen on SJS Gitub. You also have to create a SparkCassandraJob, extending SparkJob (but this part is very easy).
Finally, the ExecuteQuery job have to be adapted to use the new classes. It would be something like:
object ExecuteQuery extends SparkCassandraJob {
override def validate(cc: CassandraSQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(cc: CassandraSQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = cc.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
After that, the ExecuteQueryjob can be executed via REST with a POST request.
Conclusion
Here I use the first option because I need the advanced queries available in the HiveContext (window functions, for example), which are not available in the CassandraSQLContext. However, if you don't need those kind of operations, I recommend the second approach, even if it needs some extra coding to create a new ContextFactory for SJS.

Resources