Cosmos DB spatial query using Spark

Cosmos DB spatial query using Spark - apache-spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600

Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

Related

Using Spark UDF externally in Tableau

I am trying to write a Spark UDF and use in Tableau. Tableau is using a Databricks connection.
%python
import time
#udf
def delay_seconds(x):
time.sleep(x)
return x
spark.udf.register("delayUDF", delay_seconds)
I want to run this query in Tableau using Custom SQL
select delayUDF(10)
The function is undefined and does not exist on the default database. How do I scope the UDF so it is registered against the database, or modify my select query to correctly locate it?

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)

TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed

The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

Unable to get scalar value of a query on cosmos db in azure data factory

I am trying to get the count of all records present in cosmos db in a lookup activity of azure data factory. I need this value to do a comparison with other value activity outputs.
The query I used is SELECT VALUE count(1) from c
When I try to preview the data after inserting this query I get an error saying
One or more errors occurred. Unable to cast object of type
'Newtonsoft.Json.Linq.JValue' to type 'Newtonsoft.Json.Linq.JObject'
as shown in the below image:
snapshot of my azure lookup activity settings
Could someone help me in resolving this error and if this is the limitation of azure data factory how can I get the count of all the rows of the cosmos db document using some other ways inside azure data factory?

I reproduce your issue on my side exactly.
I think the count result can't be mapped as normal JsonObject. As workaround,i think you could just use Azure Function Activity(Inside Azure Function method ,you could use SDK to execute any sql as you want) to output your desired result: {"number":10}.Then bind the Azure Function Activity with other activities in ADF.
Here is contradiction right now:
The query sql outputs a scalar array,not other things like jsonObject,or even jsonstring.
However, ADF Look Up Activity only accepts JObject,not JValue. I can't use any convert built-in function here because the query sql need to be produced with correct syntax anyway. I already submitted a ticket to MS support team,but get no luck with this limitation.
I also tried select count(1) as num from c which works in the cosmos db portal. But it still has limitation because the sql crosses partitions.
So,all i can do here is trying to explain the root cause of issue,but can't change the product behaviours.
2 rough ideas:
1.Try no-partitioned collection to execute above sql to produce json output.
2.If the count is not large,try to query columns from db and loop the result with ForEach Activity.

You can use:
select top 1 column from c order by column desc

Why does 'get_json_object' return different results when run in spark and sql tool

I have developed a hive query that uses lateral views and get_json_object to unpack some json. The query works well enough using a jdbc client (dbvisualizer) against a hive database but when run as spark sql from a java application, on the same data, it returns nothing.
I have tracked down the problem to differences in what the function 'get_json_object' returns.
The issue can be illustrated by this type of query
select concat_ws( "|", get_json_object('{"product_offer":[
{"productName":"Plan A"},
{"productName":"Plan B"}]}',
'$.product_offer.productName') )
When run in dbvisualizer against a Hive database I get an array of the 2 product names in the json array: ["Plan A","Plan B"].
When the same query is run as spark sql from a java application, null is returned.
I have noticed another difference: the path '$.product_offer[0].productName' returns 'Plan A' in db visualizer and nothing in spark.

The path to extract the array of product names is
select concat_ws( "|", get_json_object('{"product_offer":[{"productName":"Plan A"},{"productName":"Plan B"}]}', '$.product_offer[*].productName'
which works both in spark dbvisualizer.

how to execute query on simple vertical partitioning in sqlalchemy using python

i am trying to work with multiple database and schemas using simple vertical partitioning in sqlalchemy and python .
Have create two database engines and configured successfully to sessionmaker()
Session = sessionmaker()
Session.configure(binds={BaseA:engine1, BaseB:engine2})
Able to get the required sql query generated successfully
driverssql = session.query(drivers)
but when i execute the above query to fetch the requslts i get the follwing error :
resultset=session.execute(driversql)
sqlalchemy.exc.UnboundExecutionError: Could not locate a bind configured on SQL expression or this Session (how can i associate the correct engine with execute statement)

I'm seeing here two variants:
You can create 2 sessionmakers here and use them separately according to an engine.
You can choose necessary engine when executing a query:
engine1 = create_engine(first_db)
engine2 = create_engine(second_db)
session.execute(drivers, bind=engine1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cosmos DB spatial query using Spark - apache-spark

Related

Using Spark UDF externally in Tableau

Spark SQL persistent view over jdbc data source

Unable to get scalar value of a query on cosmos db in azure data factory

Why does 'get_json_object' return different results when run in spark and sql tool

how to execute query on simple vertical partitioning in sqlalchemy using python

Categories

Resources