How to use Airflow SparkSQLOperator? - apache-spark

I'd like to run a query using Spark SQL in Airflow, it looks like the SparkSQLOperator is perfect for this (https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_sql_operator.py)
However, I can't figure out how the connection must be configured.
In DB Visualizer, I can connect to the Hive database using:
driver : jdbc
database url : jdbc:hive2://myserver.com:10000/default
database userid : me
database password : mypassword
Applying these settings to the spark_sql_default connection gives me:
enter[2017-12-12 11:35:33,774] {models.py:1462} ERROR - Cannot execute
on hive2://myserver.com:10000/default. Error code is: 1. Output: ,
Stderr:
Any ideas?

Spark isn't a database, so you configure your Spark connection just like you would if you were submitting regular Spark jobs, rather than the JDBC-like parameters you're used to.
If you look at the signature of the operator:
def __init__(self,
sql,
conf=None,
conn_id='spark_sql_default',
executor_cores=None,
executor_memory=None,
keytab=None,
master='yarn',
name='default-name',
num_executors=None,
yarn_queue='default',
*args,
**kwargs)
You need to create a connection (see models.Connection) that specifies the Spark master, which would be one of 'yarn', 'local[x]', or 'spark://hostname:port'. That should really be it, since they default the rest for you and it is likely handled by the underlying Spark config.
Your SQL or HQL code/script can be passed into the first parameter.

Related

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)
TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed
The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

Cosmos DB spatial query using Spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600
Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

How to execute Snowflake Stored Procedure from Python?

I have created stored procedure in snowflake which is executed fine in snowflake UI and also from server by using snowsql. Now I want to execute procedure from python program, I tried to execute from python, here are the steps that I have followed:
establish the connection to snowflake ( successfully able to connect.)
cs = ctx.cursor()
Used appropriate role,warehouse,database and schema.
tried to execute procedure like this:
cs.execute("call test_proc('value1', 'value2')")
x = cs.fetchall()
print(x)
But getting an erorr:
snowflake.connector.errors.ProgrammingError: 002140 (42601): SQL
compilation error: Unknown function test_proc
Can you please help me to resolve this problem.
Thanks,
When connecting to Snowflake using Python connector you could define DATABASE/SCHEMA
conn = snowflake.connector.connect(
user=USER,
password=PASSWORD,
account=ACCOUNT,
warehouse=WAREHOUSE,
database=DATABASE,
schema=SCHEMA
);
Once you have it set up, you could call your stored procedure without using fully-qualified name:
cs.execute("call test_proc('value1', 'value2')");
Alternative way is:
Using the Database, Schema, and Warehouse
Specify the database and schema in which you want to create tables. Also specify the warehouse that will provide resources for executing DML statements and queries.
For example, to use the database testdb, schema testschema and warehouse tiny_warehouse (created earlier):
conn.cursor().execute("USE WAREHOUSE tiny_warehouse_mg")
conn.cursor().execute("USE DATABASE testdb_mg")
conn.cursor().execute("USE SCHEMA testdb_mg.testschema_mg")
Actually, I have to have command like this
cs.execute("call yourdbname.schemaname.test_proc('value1', 'value2')")
and It is working as expected.
Thanks

AnalysisException when dropping a hive database using spark

Environment
spark 3.0.0
hive metastore (standalone) 3.0.0
mysql 8 as the metastore db
Problem
Every time I try to drop a database in the metastore via spark, I get AnalysisException and I don't know what is causing it or whether the drop operation is succeeding in it's entirety
Example
spark.sql(f"CREATE DATABASE IF NOT EXISTS myDb LOCATION 'shared-metastore-location/myDb.db'")
################
# DB creation succeeds and I can view the db in the metastore, add tables etc
################
spark.sql(f"DROP DATABASE myDb CASCADE")
################
AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to clean up java.sql.SQLException: The table does not comply with the requirements by an external plugin.\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)\n\tat com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)\n\tat com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1335)\n\tat com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2108)\n\tat com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1245)\n\tat com.zaxxer.hikari.pool.ProxyStatement.executeUpdate(ProxyStatement.java:117)\n\tat com.zaxxer.hikari.pool.HikariProxyStatement.executeUpdate(HikariProxyStatement.java)\n\tat org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:2741)\n\tat org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:52)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$21.notify(MetaStoreListenerNotifier.java:85)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:264)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:326)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:364)\n\tat org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1537)\n\tat org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1575)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)\n\tat org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)\n\tat com.sun.proxy.$Proxy32.drop_database(Unknown Source)\n\tat org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_database.getResult(ThriftHiveMetastore.java:14352)\n\tat org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_database.getResult(ThriftHiveMetastore.java:14336)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:111)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:107)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:119)\n\tat org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n);'
Despite the Exception, the database does disappear after I run this code. And if I try to run the drop command a second time, I get a different exception saying that the database doesn't exist. But, I have no idea whether the operation has succeeded in its entirety or whether it's leaving a mess behind. I'm not familiar enough with Hive to know what should be deleted in the metastore to completely delete a db
I seem to get the same result whether I have tables in the db or not. I've also tried dropping the table without cascade. Same result

Why does 'get_json_object' return different results when run in spark and sql tool

I have developed a hive query that uses lateral views and get_json_object to unpack some json. The query works well enough using a jdbc client (dbvisualizer) against a hive database but when run as spark sql from a java application, on the same data, it returns nothing.
I have tracked down the problem to differences in what the function 'get_json_object' returns.
The issue can be illustrated by this type of query
select concat_ws( "|", get_json_object('{"product_offer":[
{"productName":"Plan A"},
{"productName":"Plan B"}]}',
'$.product_offer.productName') )
When run in dbvisualizer against a Hive database I get an array of the 2 product names in the json array: ["Plan A","Plan B"].
When the same query is run as spark sql from a java application, null is returned.
I have noticed another difference: the path '$.product_offer[0].productName' returns 'Plan A' in db visualizer and nothing in spark.
The path to extract the array of product names is
select concat_ws( "|", get_json_object('{"product_offer":[{"productName":"Plan A"},{"productName":"Plan B"}]}', '$.product_offer[*].productName'
which works both in spark dbvisualizer.

Resources