Facing Issue in writing PySpark Dataframe to Azure Synapse - azure

I have a PySpark dataframe in Azure Databricks. I want to write into Azure Synapse. But i am getting below error.
com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'ETL_TableName' has a data type that cannot participate in a columnstore index.
I checked connection for Synapse .All works fine and i am able to read the data. But While writing , i am getting issue . Could anyone please help how to handle this error.
Code For Writing data into Synapse:
dataFrame.repartition(1).write.format("jdbc")\
.option("url", azureurl)\
.option("tempDir", tempDir) \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", dbTable)\
.option("append", "true")\
.save()

Couple of things needs to be changed.
Format should be .format("jdbc") => .format("com.databricks.spark.sqldw").
Add this option "tableOptions" clause to your write statement. It takes the place of the with() clause of the CREATE TABLE (AS) statement:
.option ("tableOptions","heap,distribution=MY_DISTRIBUTION")
Code should looks like this:
dataFrame.repartition(1).write.format("com.databricks.spark.sqldw")\
.option("tableOptions","heap,distribution=HASH(rownum)")
.option("url", azureurl)\
.option("tempDir", tempDir) \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", dbTable)\
.option("append", "true")\
.save()
Reference:
Azure Databricks - Azure Synapse Analytics
Choose a value for MY_DISTRIBUTION based on the following guidance:
Guidance for designing distributed tables in Synapse SQL pool

This is not be the exact answer of above issue . But it may help someone to get over .
"I still have no clue about reason behind above issue" . But i noticed this issue is coming when trying to write into Azure Synapse Warehouse. As i don't have any strict reason for sticking to Synapse Warehouse and whole priority was to write data from Databricks to Azure in structured format , i replace Azure Synapse warehouse with Azure SQL Server database .And its working much better.Will update the answer once i found the actual reason behind the issue.

Related

Spark integration with Vertica Failing

We are using Vertica Community Edition "vertica_community_edition-11.0.1-0", and are using Spark 3.2, with local[*] master. When we are trying to save data in vertica database using following:
member.write()
.format("com.vertica.spark.datasource.VerticaSource")
.mode(SaveMode.Overwrite)
.option("host", "192.168.1.25")
.option("port", "5433")
.option("user", "Fred")
.option("db", "store")
.option("password", "password")
//.option("dbschema", "store")
.option("table", "Test")
// .option("staging_fs_url", "hdfs://172.16.20.17:9820")
.save();
We are getting following exception:
com.vertica.spark.util.error.ConnectorException: Fatal error: spark context did not exist
at com.vertica.spark.datasource.VerticaSource.extractCatalog(VerticaDatasourceV2.scala:76)
at org.apache.spark.sql.connector.catalog.CatalogV2Util$.getTableProviderCatalog(CatalogV2Util.scala:363)
Kindly let know how to solve the exception.
We had a similar case. The root cause was that SparkSession.getActiveSession() returned None, due to that spark session was registered on another thread of the JVM. We could still get to the single session we had using SparkSession.getDefaultSession() and manually register it with SparkSession.SetActiveSession(...).
Our case happened in a jupyter kernel where we were using pyspark.
The workaround code was:
sp = sc._jvm.SparkSession.getDefaultSession().get()
sc._jvm.SparkSession.setActiveSession(sp)
I can't try scala or java, I suppose it should look like this:
SparkSession.setActiveSession(SparkSession.getDefaultSession())
vertica doesn't support spark version 3.2 with vertica 11.0 officially. Please find the below documentation link.
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm
Please try using spark connector v2 with the supported version of spark and try running examples from github
https://github.com/vertica/spark-connector/tree/main/examples

Setting up Azure SQL External Metastore for Azure Databricks — Invalid column name ‘IS_REWRITE_ENABLED’

I’m attempting to set up an external Hive metastore for Azure Databricks. The Metastore is in Azure SQL and the Hive version is 1.2.1 (included with azure HdInsight 3.6).
I have followed the setup instructions on the “External Apache Hive metastore” page in Azure documentation.
I can see all of the databases and tables in the metastore but if I look at a specific table I get the following.
Caused by: javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0."NAME" = ?
NestedThrowables:
com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'IS_REWRITE_ENABLED'.
I was expecting to see errors relating to the underlying storage but this appears to be a problem with the metastore.
Anybody have any idea what’s wrong?
The error message seems to suggest that the column IS_REWRITE_ENABLED doesn't exist on table TBLS with alias A0.
Looking at the script for the hive-schema for the derby db, it can help guide you in seeing that the column in question does exist here.
metastore script definition
If you've got admin access to the Azure SQL db, you can alter the table and add the column:
ALTER TABLE TBLS
ADD IS_REWRITE_ENABLED char(1) NOT NULL DEFAULT 'N';
I don't believe this is the actual fix but it does work around the error.

Remote Database not found while Connecting to remote Hive from Spark using JDBC in Python?

I am using pyspark script to read data from remote Hive through JDBC Driver. I have tried other method using enableHiveSupport, Hive-site.xml. but that technique is not possible for me due to some limitations(Access was blocked to launch yarn jobs from outside the cluster). Below is the only way I can connect to Hive.
from pyspark.sql import SparkSession
spark=SparkSession.builder \
.appName("hive") \
.config("spark.sql.hive.metastorePartitionPruning", "true") \
.config("hadoop.security.authentication" , "kerberos") \
.getOrCreate()
jdbcdf=spark.read.format("jdbc").option("url","urlname")\
.option("driver","com.cloudera.hive.jdbc41.HS2Driver").option("user","username").option("dbtable","dbname.tablename").load()
spark.sql("show tables from dbname").show()
Giving me below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'vqaa' not found;
Could someone please help how I can access remote db/tables using this method? Thanks
add .enableHiveSupport() to your sparksession in order to access hive catalog

How can I connect Azure Databricks to Cosmos DB using MongoDB API?

I have created one azure CosmosDB account using MongoDB API. I need to connect CosmosDB(MongoDB API) to Azure Databricks cluster in order to read and write data from cosmos.
How to connect Azure Databricks cluster to CosmosDB account?
Here is the pyspark piece of code I use to connect to a CosmosDB database using MongoDB API from Azure Databricks (5.2 ML Beta (includes Apache Spark 2.4.0, Scala 2.11) and MongoDB connector: org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
):
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.getOrCreate()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", CONNECTION_STRING) \
.load()
With a CONNECTION_STRING that looks like that:
"mongodb://USERNAME:PASSWORD#testgp.documents.azure.com:10255/DATABASE_NAME.COLLECTION_NAME?ssl=true&replicaSet=globaldb"
I tried a lot of different other options (adding database and collection names as option or config of the SparkSession) without success.
Tell me if it works for you...
After adding the org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 package, this worked for me:
import json
query = {
'$limit': 100,
}
query_config = {
'uri': 'myConnectionString'
'database': 'myDatabase',
'collection': 'myCollection',
'pipeline': json.dumps(query),
}
df = spark.read.format("com.mongodb.spark.sql") \
.options(**query_config) \
.load()
I do, however, get this error with some collections:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.6, executor 0): com.mongodb.MongoInternalException: The reply message length 10168676 is less than the maximum message length 4194304
Answering the same way I did to my own question.
Using MAVEN as the source, I installed the right library to my cluster using the path
org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
Spark 2.4
An example of code I used is as follows (for those who wanna try):
# Read Configuration
readConfig = {
"URI": "<URI>",
"Database": "<database>",
"Collection": "<collection>",
"ReadingBatchSize" : "<batchSize>"
}
pipelineAccounts = "{'$sort' : {'account_contact': 1}}"
# Connect via azure-cosmosdb-spark to create Spark DataFrame
accountsTest = (spark.read.
format("com.mongodb.spark.sql").
options(**readConfig).
option("pipeline", pipelineAccounts).
load())
accountsTest.select("account_id").show()

Connection from Spark to snowflake

I am writing this not for asking the question, but sharing the knowledge.
I was using Spark to connect to snowflake. But I could not access snowflake. It seemed like there was something wrong with internal JDBC driver in databricks.
Here was the error I got.
java.lang.NoClassDefFoundError:net/snowflake/client/jdbc/internal/snowflake/common/core/S3FileEncryptionMaterial
I tried many versions of snowflake jdbc drivers and snowflake drivers. It seemed like I could match the correct one.
Answer as given by the asker (I just extracted it from the question for better site usability:
Step 1: Create cluster with Spark version - 2.3.0. and Scala Version - 2.11
Step 2: Attached snowflake-jdbc-3.5.4.jar to the cluster.
https://mvnrepository.com/artifact/net.snowflake/snowflake-jdbc/3.5.4
Step 3: Attached spark-snowflake_2.11-2.3.2 driver to the cluster.
https://mvnrepository.com/artifact/net.snowflake/spark-snowflake_2.11/2.3.2
Here is the sample code.
val SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
val sfOptions = Map(
"sfURL" -> "<snowflake_url>",
"sfAccount" -> "<your account name>",
"sfUser" -> "<your account user>",
"sfPassword" -> "<your account pwd>",
"sfDatabase" -> "<your database name>",
"sfSchema" -> "<your schema name>",
"sfWarehouse" -> "<your warehouse name>",
"sfRole" -> "<your account role>",
"region_id"-> "<your region name, if you are out of us region>"
)
val df: DataFrame = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "<your table>")
.load()
If you are using Databricks, there is a Databricks Snowflake connector created jointly by Databricks and Snowflake people. You just have to provide a few items to create a Spark dataframe (see below -- copied from the Databricks document).
# snowflake connection options
options = dict(sfUrl="<URL for your Snowflake account>",
sfUser=user,
sfPassword=password,
sfDatabase="<The database to use for the session after connecting>",
sfSchema="<The schema to use for the session after connecting>",
sfWarehouse="<The default virtual warehouse to use for the session after connecting>")
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("dbtable", "<The name of the table to be read>") \
.load()
display(df)
As long as you are accessing your own databases with all the access rights granted correctly, this only take a few minutes, even during our first attempt.
Good luck!
You need to set the CLASSPATH Variables to point to jar like below. You need to set up SPARK_HOME and SCALA_HOME besides PYTHONPATH also.
export CLASSPATH=/snowflake-jdbc-3.8.0.jar:/spark-snowflake_2.11-2.4.14-spark_2.4.jar
You can also load in memory jars in your code to resolve this issue.
spark = SparkSession \
.builder \
.config("spark.jars", "file:///app/snowflake-jdbc-3.9.1.jar,file:///app/spark-snowflake_2.11-2.5.3-spark_2.2.jar") \
.config("spark.repl.local.jars",
"file:///app/snowflake-jdbc-3.9.1.jar,file:///app/spark-snowflake_2.11-2.5.3-spark_2.2.jar") \
.config("spark.sql.catalogImplementation", "in-memory") \
.getOrCreate()
Please update to the latest version of the Snowflake JDBC driver (3.2.5); that should resolve this issue. Thanks!

Resources