databrick: spark read from Azure SQL DW error - databricks

Currently in my notebook which is using
spark.read \
.format("com.databricks.spark.sqldw")\
.option("url", myurl)
to read data from Azure SQL DW . But now suddenly there is error :
Caused by: java.lang.IllegalArgumentException: Unexpected version returned: Microsoft Azure Synapse SQL Analytics - 10.0.10887.0 Dec 18 2019 21:47:50 Copyright (c) Microsoft Corporation Make sure your JDBC url includes a "database=" option and that it points to a valid Azure SQL Data Warehouse name. This connector cannot be used for interacting with any other systems than DW (e.g. Azure SQL Databases).
what's happening ???
when i use
spark.read \
.format("jdbc")\
.option("url", myurl)
this is ok . why ?

The problem occurred due to the change in the name of Azure Data Warehouse by Microsoft. I faced the same issue today and raised it to Microsoft Support. They are working on the rollback at the moment. Hope to see the resolution soon.

Related

Issue loading big data using Apache Spark Connector for SQL Server to Azure SQL

I'm trying to load a pyspark dataframe into Azure SQL DB using Apache Spark Connector for SQL Server and Azure SQL in Azure DataBricks Env
[Environment] - Azure DataBricks
DBR: 9.1 LTS
Driver and Worker nodes: DS3_V2
No. of workers: 2 to 8 [AutoScaling]
[Dataset] - NYC Yellow Taxi Dataset
It works fine for data size around 30M, but for the data sizes around 90M I get the below issue:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 5 in stage 20.0 failed 4 times, most recent failure: Lost task
5.3 in stage 20.0 (TID 381) (10.139.64.7 executor 5): com.microsoft.sqlserver.jdbc.SQLServerException: Database
'[database]' on server '[servername]' is not currently
available. Please retry the connection later. If the problem persists,
contact customer support, and provide them the session tracing ID of
[some id]
The code that I use:
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", True) \
.option("url", url) \
.option("dbtable", "dbo.nyc_yellow_trip_test_2017") \
.option("user", username) \
.option("password", password) \
.save()
except ValueError as error :
print("Connector write failed", error)
Sometimes that error comes as result of intermittent failures on specific regions.
You can check Resource health in Left vertical panel as shown in below image.
In the cloud environment you'll find that failed and dropped database connections happen periodically. That's partly because you're going through more load balancers compared to the on-premises environment where your web server and database server have a direct physical connection. Also, sometimes when you're dependent on a multi-tenant service you'll see calls to the service get slower or time out because someone else who uses the service is hitting it heavily. In other cases you might be the user who is hitting the service too frequently, and the service deliberately throttles you – denies connections – in order to prevent you from adversely affecting other tenants of the service.
Refer - https://learn.microsoft.com/en-us/answers/questions/212108/database-x-on-server-y-is-not-currently-available.html

Is it possible to read from Azure Log Analytics workspace using Apache Spark?

I would like to know if a connector exists to access Azure Log Analytics workspaces from Apache Spark. I know that azure-kusto-spark can access a Kusto cluster from Azure Data Explorer, but can the same connector be used to connect to Log Analytics workspaces? I was under the impression that Log Analytics was built on top of Data Explorer...
I tried to use azure-kusto-spark, but the only configuration it seems to support is cluster-based, nothing about workspace names like it would be for Log Analytics.
Here is an example in case anyone is looking for one like I was
df = spark.read.format("com.microsoft.kusto.spark.datasource") \
.option("kustoCluster", "https://ade.loganalytics.io/subscriptions/xxxxxxx/resourceGroups/my_resource_group/providers/microsoft.operationalinsights/workspaces/my-log-anlytics-workspace-name") \
.option("kustoDatabase", "my-log-anlytics-workspace-name") \
.option("kustoQuery", "ADXQuery |take 10") \
.option("kustoAadAppId", "service-principal-app-client-id") \
.option("kustoAadAppSecret",dbutils.secrets.get(scope="key-vault-secrets",key="service-principal-secret")) \
.option("kustoAadAuthorityID", "72f988bf-86f1-41af-91ab-2d7cd011db47") \
.load()

Databricks - transfer data from one databricks workspace to another

How can I transform my data in databricks workspace 1 (DBW1) and then push it (send/save the table) to another databricks workspace (DBW2)?
On the DBW1 I installed this JDBC driver.
Then I tried:
(df.write
.format("jdbc")
.options(
url="jdbc:spark://<DBW2-url>:443/default;transportMode=http;ssl=1;httpPath=<http-path-of-cluster>;AuthMech=3;UID=<uid>;PWD=<pat>",
driver="com.simba.spark.jdbc.Driver",
dbtable="default.fromDBW1"
)
.save()
)
However, when I run it I get:
java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException:
How to do this correctly?
Note: each DBW is in different subscription.
From my point of view, the more scalable way would be to write directly into ADLS instead of using JDBC. But this needs to be done as following:
You need to have a separate storage account for your data. Anyway, use of DBFS Root for storage of the actual data isn't recommended as it's not accessible from outside - that makes things, like, migration, more complicated.
You need to have a way to access that storage account (ADLS or Blob storage). You can use access data directly (via abfss:// or wasbs:// URLs)
In the target workspace you just create a table for your data written - so called unmanaged table. Just do (see doc):
create table <name>
using delta
location 'path_or_url_to data'

Azure Databricks : Mount delta table used in another workspace

Currently I have an azure databricks instance where I have the following
myDF.withColumn("created_on", current_timestamp())\
.writeStream\
.format("delta")\
.trigger(processingTime= triggerDuration)\
.outputMode("append")\
.option("checkpointLocation", "/mnt/datalake/_checkpoint_Position")\
.option("path", "/mnt/datalake/DeltaData")\
.partitionBy("col1", "col2", "col3", "col4", "col5")\
.table("deltadata")
This is saving the data into a storage account as blobs.
Now, I'm trying to connect to this table from another azure databricks workspace and my first "move" is the mount to the azure storage account:
dbutils.fs.mount(
source = sourceString,
mountPoint = "/mnt/data",
extraConfigs = Map(confKey -> sasKey)
Note: sourceString, confKey and sasKey are not shown for obvious reasons, in any case the mount works fine.
And then I try to create the table, but I get an error:
CREATE TABLE delta_data USING DELTA LOCATION '/mnt/data/DeltaData/'
Error in SQL statement: AnalysisException:
You are trying to create an external table `default`.`delta_data`
from `/mnt/data/DeltaData` using Databricks Delta, but the schema is not specified when the
input path is empty.
According to the documentation the schema should be picked up from the existing data correct?
Also, I trying to do this in a different workspace because the idea is to give only read access to people.
It seems my issue was the mount. It did not give any error while creating it but was not working fine. I discovered this after trying:
dbutils.fs.ls("/mnt/data/DeltaData")
Which was not showing anything. I unmounted and reviewed all the configs and after that it worked.

Permission denied while inserting data from Azure Databricks to Synapse in production environment

We all have a scenario in our project where we are inserting data from Databricks dataframes into Azure Synapse. While we could do this without issues on Dev environment with admin access, we could not run this in higher environment. On Higher environments, Providing INSERT permission on the schema.
The error message I get…
Py4JJavaError: An error occurred while calling o2445.save. :
com.databricks.spark.sqldw.SqlDWSideException: SQL DW failed to
execute the JDBC query produced by the connector. Underlying
SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException:
User does not have permission to perform this action. [ErrorCode =
15247] [SQLState = S0001]
Assuming you took this approach then you will need CONTROL Database (db_owner) permissions in Synapse because it is currently required for Databricks to run CREATE DATABASE SCOPED CREDENTIAL
Though this feedback item is related to Azure Data Factory, if it were completed then more granular permissions could be used. So please vote and see my comment.

Resources