How to use scala and read file from Azure blob storage? - azure

I'm trying to use the below Scala code to read a csv file from Azure blob storage.
val containerName = "azsqlshackcontainer"
val storageAccountName = "cloudshell162958911"
val sas = "?sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupiyx&se=2022-11-16T17:11:59Z&st=2022-11-16T09:11:59Z&spr=https&sig=ZAy5PeZu5jbICr5B%2BFTLLn6C5TMBxrU5WmbLCRfzNu8%3D"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
dbutils.fs.mount(
source = "wasbs://azsqlshackcontainer#cloudshell162958911.blob.core.windows.net/FoodSales.csv",
mountPoint = "/mnt/myfile",
extraConfigs = Map(config -> sas))
When I run this code, I get below error.
The container name, storage account name and file names are correct.
I replicated the steps given here: https://www.sqlshack.com/accessing-azure-blob-storage-from-azure-databricks/
I'm not sure what I'm missing. I am able to do it using Python, but the scala code is not working

Try this alternative approach with access key, I tried to reproduce it in my environment and I got the below results:
dbutils.fs.mount(
source = "wasbs://<container>#<storage_account_name>.blob.core.windows.net/",
mountPoint = "/mnt/<mount_name>",
extraConfigs = Map("fs.azure.account.key.<<Your_storage_account_name>>.blob.core.windows.net" -> "<< Access_key>>"))
Now, You can check. I can read CSV data with a mount path.
%scala
val df1 = spark.read.format("csv").option("inferSchema", "true").option("Header","true").load("/mnt/dsff")
display(df1)

Related

Acces file from Azure Data Lake sensitive storage by databricks

I m accesing to files in the normal storage by the following method:
input_path = "my_path"
file= "file.mp3"
path = os.path.join(path_data, file)
full_path = '/dbfs/' + path
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
I am not able to use the same method in sensitive storage
I am aware that sensitive storage have another way to acces data
path_sensitive_storage = 'mypath_sensitive'
If I use spark it works perfectly, but i am interested in not using spark read but open file
input_df = (spark.read
.format("binaryFile")
.option("header", "true")
.option("encoding", "UTF-8")
.csv(full_path)
)
There is a way to do that ?
Since you are using Azure Data Lake as a source, you need to mount the container in Databricks DBFS by using OAuth method. Once the container is mounted, you can use it.
Use the code below to mount the container.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "ba219eb4-0250-4780-8bd3-d7f3420dab6d",
"fs.azure.account.oauth2.client.secret": "0wP8Q~qWUwGSFrjyByvwK-.HjrHx2EEvG06X9cmy",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://sample11#utrolicstorage11.dfs.core.windows.net/",
mount_point = "/mnt/sampledata11",
extra_configs = configs)
Once mounted, you can use below code to list the files in mounted location.
dbutils.fs.ls("/mnt/sampledata11/")
And finally use with open statement to read the file
with open("/dbfs/mnt/sampledata11/movies.csv", mode='rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Check the image below for the complete implementation and the outputs below.

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm not able to move/copy or rename the files

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm able to delete a folder or file
but not able to move/copy or rename the files , keep getting the error
Below is the snipped I used:
from pyspark.sql import SparkSession
# prepare spark session
spark = SparkSession.builder.appName('filesystemoperations').getOrCreate()
# spark context
sc = spark.sparkContext
# File path declaration
containerName = "ZZZZZZZ"
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=containerName)
tablePath = "/ws/tables/"
evnetName = 'abcd'
tableFilename= "Game_"+eventName+".kql"
tableFile = fullRootPath + tablePath + tableFilename
tempTableFile = fullRootPath + tablePath + tempPath + tableFilename
#empty the paths
sc._jsc.hadoopConfiguration().set('fs.defaultFS', fullRootPath)
fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration()))
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(fullRootPath+tablePath), True)
## This one worked fine.
# dfStringFinalF contains a string
spark.sparkContext.parallelize([dfStringFinalF]).coalesce(1).saveAsTextFile(tempTunctionFile)
fs.copy(tempTunctionFile+'/part-00000' , tableFile)
#Copy , rename, cp, mv, nothing working on fs
Please help
It seems in Azure Synapse Analytics is having limitations with Spark Context and shutil libraries.
mssparkutils library helps in copying/moving files in the BLOB containers.
Here is the code sample
from notebookutils import mssparkutils
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=zzzz)
fsmounting = mssparkutils.fs.mount(
fullRootPath,
"/ws",
{"linkedService":"abcd"}
)
fsMove = mssparkutils.fs.mv(tempTableFile+'/part-00000',tableFile )
More reference from the below link :
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api

Error Databricks Scala Application with Azure Table Storage

first of all thank you for your time for the next question :)
I am trying to connect Databricks Scala Application with Azure Table Storage, however I am getting the following error:
Azure Table Scala APP
Error:
NoSuchMethodError:
reactor.netty.http.client.HttpClient.resolver(Lio/netty/resolver/AddressResolverGroup;)Lreactor/netty/transport/ClientTransport;
at
com.azure.core.http.netty.NettyAsyncHttpClientBuilder.build(NettyAsyncHttpClientBuilder.java:94)
at
com.azure.core.http.netty.NettyAsyncHttpClientProvider.createInstance(NettyAsyncHttpClientProvider.java:18)
at
com.azure.core.implementation.http.HttpClientProviders.createInstance(HttpClientProviders.java:58)
at com.azure.core.http.HttpClient.createDefault(HttpClient.java:50) at
com.azure.core.http.HttpClient.createDefault(HttpClient.java:40) at
com.azure.core.http.HttpPipelineBuilder.build(HttpPipelineBuilder.java:62)
at
com.azure.data.tables.BuilderHelper.buildPipeline(BuilderHelper.java:122)
at
com.azure.data.tables.TableServiceClientBuilder.buildAsyncClient(TableServiceClientBuilder.java:161)
at
com.azure.data.tables.TableServiceClientBuilder.buildClient(TableServiceClientBuilder.java:93)
I attach the code:
val clientCredential: ClientSecretCredential = new ClientSecretCredentialBuilder()
.tenantId(tenantID)
.clientId(client_san_Id)
.clientSecret(client_san_Secret)
.build()
val tableService = new TableServiceClientBuilder()
.endpoint("https://<Resource-Table>.table.core.windows.net")
.credential(clientCredential)
.buildClient()
Thank you very much for your time!
First you need to mount Storage on Azure databricks.
Then use the code below to mount Table Storage.
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
Access table storage using below code:
// scala
val df = spark.read.text("/mnt/<mount-name>/...")
val df = spark.read.text("dbfs:/<mount-name>/...")
You can refer this notebook
Also refer this article by Gauri Mahajan

FileUtils write method does not work on Azure Databricks

I have troubles writing a file on my Databricks cluster's driver (as a temp file). I have a scala notebook on my company's Azure Databricks which contains those lines of code :
val xml: String = Controller.requestTo(url)
val bytes: Array[Byte] = xml.getBytes
val path: String = "dbfs:/data.xml"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
dbutils.fs.ls("dbfs:/")
val df = spark.read.format("com.databricks.spark.xml")
.option("rowTag", "generic:Obs")
.load(path)
df.show
file.delete()
however it crashes with org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: dbfs:/data.xml. When I run a ls on the root of the dbfs, it doesn't show the file data.xml so for me FileUtils is not doing it's job. What puts me even more in troubles is that the following code works when running it on the same cluster, same Azure resource group, same instance of Databricks, but in another notebook :
val path: String = "mf-data.grib"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
I tried to restart the cluster, remove "dbfs:/" from the path, put the file in the dbfs:/tmp/ directory, use FileUtils.writeStringToFile(file, xml, StandardCharsets.UTF_8) instead of FileUtils.writeByteArrayToFile but none of those solutions has worked, even when combining them.
If you're using local APIs, like, File, you need to use corresponding local file access - instead of using dbfs:/ you need to prefix path with /dbfs/, so your code will look as following:
val file: File = new File(path.replaceFirst("dbfs:", "/dbfs")
Try to remove the dbfs here : val path: String = "dbfs:/data.xml" for understanding purposes I have given 3 different magical command cells %sh , %fs, %scala . You can ref : here

How to list the files in azure data lake using spark from pycharm(local IDE) which is connected using databricks-connect

I am working on some code on my local machine on pycharm.
The execution is done on a databricks cluster, while the data is stored on azure datalake.
basaically, I need to list down the files in azure datalake directory and then apply some reading logic on the files, for this I am using the below code
sc = spark.sparkContext
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('adl://<Account>.azuredatalakestore.net/<path>')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
the above code runs fine on the databricks notebooks, but when i try to run the same code through pycharm using databricks-connect i get the following error.
"Wrong FS expected: file:///....."
on some digging it turns out, that the code is looking in my local drive to find the "path".
I had a similar issue with python libraries (os, pathlib)
I have no issue in running other code on the cluster.
Need help in figuring out how to run this so as to search the datalake and not my local machine.
Also, azure-datalake-store client is not an option due to certain restrictions.
You may use this.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
files.toDF("path").show()

Resources