Error Databricks Scala Application with Azure Table Storage - azure

first of all thank you for your time for the next question :)
I am trying to connect Databricks Scala Application with Azure Table Storage, however I am getting the following error:
Azure Table Scala APP
Error:
NoSuchMethodError:
reactor.netty.http.client.HttpClient.resolver(Lio/netty/resolver/AddressResolverGroup;)Lreactor/netty/transport/ClientTransport;
at
com.azure.core.http.netty.NettyAsyncHttpClientBuilder.build(NettyAsyncHttpClientBuilder.java:94)
at
com.azure.core.http.netty.NettyAsyncHttpClientProvider.createInstance(NettyAsyncHttpClientProvider.java:18)
at
com.azure.core.implementation.http.HttpClientProviders.createInstance(HttpClientProviders.java:58)
at com.azure.core.http.HttpClient.createDefault(HttpClient.java:50) at
com.azure.core.http.HttpClient.createDefault(HttpClient.java:40) at
com.azure.core.http.HttpPipelineBuilder.build(HttpPipelineBuilder.java:62)
at
com.azure.data.tables.BuilderHelper.buildPipeline(BuilderHelper.java:122)
at
com.azure.data.tables.TableServiceClientBuilder.buildAsyncClient(TableServiceClientBuilder.java:161)
at
com.azure.data.tables.TableServiceClientBuilder.buildClient(TableServiceClientBuilder.java:93)
I attach the code:
val clientCredential: ClientSecretCredential = new ClientSecretCredentialBuilder()
.tenantId(tenantID)
.clientId(client_san_Id)
.clientSecret(client_san_Secret)
.build()
val tableService = new TableServiceClientBuilder()
.endpoint("https://<Resource-Table>.table.core.windows.net")
.credential(clientCredential)
.buildClient()
Thank you very much for your time!

First you need to mount Storage on Azure databricks.
Then use the code below to mount Table Storage.
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
Access table storage using below code:
// scala
val df = spark.read.text("/mnt/<mount-name>/...")
val df = spark.read.text("dbfs:/<mount-name>/...")
You can refer this notebook
Also refer this article by Gauri Mahajan

Related

How to load data from Azure Databricks SQL to GCP Databricks SQL

Is there an easy way to load data from Azure Databricks Spark DB to GCP Databricks Spark DB?
Obtain JDBC details from Azure instance and use them in GCP to pull data just as from any other JDBC source.
// This is run in GCP instance
some_table = spark.read
.format("jdbc")
.option("url", "jdbc:databricks://adb-xxxx.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/xxxx;AuthMech=3;UID=token;PWD=xxxx")
.option("dbtable", "some_table")
.load()
Assuming Azure data is stored in Blob/ADLSv2 storage, mount it in GCP instance's DBFS and read data directly.
// This is run in GCP instance
// Assuming ADLSv2 on Azure side
val configs = Map(
"fs.azure.account.auth.type" -> "OAuth",
"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" -> "<application-id>",
"fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
some_data = spark.read
.format("delta")
.load("/mnt/<mount_name>/<some_schema>/<some_table>")

How to use scala and read file from Azure blob storage?

I'm trying to use the below Scala code to read a csv file from Azure blob storage.
val containerName = "azsqlshackcontainer"
val storageAccountName = "cloudshell162958911"
val sas = "?sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupiyx&se=2022-11-16T17:11:59Z&st=2022-11-16T09:11:59Z&spr=https&sig=ZAy5PeZu5jbICr5B%2BFTLLn6C5TMBxrU5WmbLCRfzNu8%3D"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
dbutils.fs.mount(
source = "wasbs://azsqlshackcontainer#cloudshell162958911.blob.core.windows.net/FoodSales.csv",
mountPoint = "/mnt/myfile",
extraConfigs = Map(config -> sas))
When I run this code, I get below error.
The container name, storage account name and file names are correct.
I replicated the steps given here: https://www.sqlshack.com/accessing-azure-blob-storage-from-azure-databricks/
I'm not sure what I'm missing. I am able to do it using Python, but the scala code is not working
Try this alternative approach with access key, I tried to reproduce it in my environment and I got the below results:
dbutils.fs.mount(
source = "wasbs://<container>#<storage_account_name>.blob.core.windows.net/",
mountPoint = "/mnt/<mount_name>",
extraConfigs = Map("fs.azure.account.key.<<Your_storage_account_name>>.blob.core.windows.net" -> "<< Access_key>>"))
Now, You can check. I can read CSV data with a mount path.
%scala
val df1 = spark.read.format("csv").option("inferSchema", "true").option("Header","true").load("/mnt/dsff")
display(df1)

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm not able to move/copy or rename the files

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm able to delete a folder or file
but not able to move/copy or rename the files , keep getting the error
Below is the snipped I used:
from pyspark.sql import SparkSession
# prepare spark session
spark = SparkSession.builder.appName('filesystemoperations').getOrCreate()
# spark context
sc = spark.sparkContext
# File path declaration
containerName = "ZZZZZZZ"
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=containerName)
tablePath = "/ws/tables/"
evnetName = 'abcd'
tableFilename= "Game_"+eventName+".kql"
tableFile = fullRootPath + tablePath + tableFilename
tempTableFile = fullRootPath + tablePath + tempPath + tableFilename
#empty the paths
sc._jsc.hadoopConfiguration().set('fs.defaultFS', fullRootPath)
fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration()))
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(fullRootPath+tablePath), True)
## This one worked fine.
# dfStringFinalF contains a string
spark.sparkContext.parallelize([dfStringFinalF]).coalesce(1).saveAsTextFile(tempTunctionFile)
fs.copy(tempTunctionFile+'/part-00000' , tableFile)
#Copy , rename, cp, mv, nothing working on fs
Please help
It seems in Azure Synapse Analytics is having limitations with Spark Context and shutil libraries.
mssparkutils library helps in copying/moving files in the BLOB containers.
Here is the code sample
from notebookutils import mssparkutils
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=zzzz)
fsmounting = mssparkutils.fs.mount(
fullRootPath,
"/ws",
{"linkedService":"abcd"}
)
fsMove = mssparkutils.fs.mv(tempTableFile+'/part-00000',tableFile )
More reference from the below link :
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api

How to list the files in azure data lake using spark from pycharm(local IDE) which is connected using databricks-connect

I am working on some code on my local machine on pycharm.
The execution is done on a databricks cluster, while the data is stored on azure datalake.
basaically, I need to list down the files in azure datalake directory and then apply some reading logic on the files, for this I am using the below code
sc = spark.sparkContext
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('adl://<Account>.azuredatalakestore.net/<path>')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
the above code runs fine on the databricks notebooks, but when i try to run the same code through pycharm using databricks-connect i get the following error.
"Wrong FS expected: file:///....."
on some digging it turns out, that the code is looking in my local drive to find the "path".
I had a similar issue with python libraries (os, pathlib)
I have no issue in running other code on the cluster.
Need help in figuring out how to run this so as to search the datalake and not my local machine.
Also, azure-datalake-store client is not an option due to certain restrictions.
You may use this.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
files.toDF("path").show()

Spark dataframe returning only structure when connected to Phoenix query server

I am connecting to hbase ( ver 1.2) via phoenix (4.11) queryserver from Spark 2.2.0, but the dataframe is returning the only table structure with empty rows thoug data is present in table.
Here is the code I am using to connect to queryserver.
// ---jar ----phoenix-4.11.0-HBase-1.2-thin-client.jar<br>
val prop = new java.util.Properties
prop.setProperty("driver", "org.apache.phoenix.queryserver.client.Driver")
val url = "jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF"
val d1 = spark.sqlContext.read.jdbc(url,"TABLE1",prop)
d1.show()
Can anyone please help me in solving this issue. Thanks in advance
If you are using spark2.2 the better approach would be to load directly via pheonix as a dataframe.This way you would provide the zookeeper url only and you can provide a predicate so that you load only the data required and not the entire data.
import org.apache.phoenix.spark._
import org.apache.hadoop.conf.Configuration
importĀ org.apache.spark.sql.SparkSession
val configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "localhost:2181");
valĀ spark = SparkSession.builder().master("local").enableHiveSupport().getOrCreate()
val df=spark.sqlContext.phoenixTableAsDataFrame("TABLE1",Seq("COL1","COL2"),predicate = Some("\"COL1\" = 1"),conf = configuration)
Read this for more info on getting table as rdd and saving dataframes and rdd's .

Resources