Acces file from Azure Data Lake sensitive storage by databricks - azure

I m accesing to files in the normal storage by the following method:
input_path = "my_path"
file= "file.mp3"
path = os.path.join(path_data, file)
full_path = '/dbfs/' + path
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
I am not able to use the same method in sensitive storage
I am aware that sensitive storage have another way to acces data
path_sensitive_storage = 'mypath_sensitive'
If I use spark it works perfectly, but i am interested in not using spark read but open file
input_df = (spark.read
.format("binaryFile")
.option("header", "true")
.option("encoding", "UTF-8")
.csv(full_path)
)
There is a way to do that ?

Since you are using Azure Data Lake as a source, you need to mount the container in Databricks DBFS by using OAuth method. Once the container is mounted, you can use it.
Use the code below to mount the container.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "ba219eb4-0250-4780-8bd3-d7f3420dab6d",
"fs.azure.account.oauth2.client.secret": "0wP8Q~qWUwGSFrjyByvwK-.HjrHx2EEvG06X9cmy",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://sample11#utrolicstorage11.dfs.core.windows.net/",
mount_point = "/mnt/sampledata11",
extra_configs = configs)
Once mounted, you can use below code to list the files in mounted location.
dbutils.fs.ls("/mnt/sampledata11/")
And finally use with open statement to read the file
with open("/dbfs/mnt/sampledata11/movies.csv", mode='rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Check the image below for the complete implementation and the outputs below.

Related

Databricks/Spark read custom metadata from Parquet file

I created a Parquet file with custom metadata at file level:
Now I'm trying to read that metadata from the Parquet file in (Azure) Databricks. But when I run the following code I don't get any metadata which is present there.
storageaccount = 'zzzzzz'
containername = 'yyyyy'
access_key = 'xxxx'
spark.conf.set(f'fs.azure.account.key.{storageaccount}.blob.core.windows.net', access_key)
path = f"wasbs://{containername}#{storageaccount}.blob.core.windows.net/generated_example_10m.parquet"
data = spark.read.format('parquet').load(path)
print(data.printSchema())
I try to reproduce same thing in my environment. I got this output.
Please follow below code and Use select("*", "_metadata")
path = "wasbs://<container>#<storage_account_name>.blob.core.windows.net/<file_path>.parquet"
data = spark.read.format('parquet').load(path).select("*", "_metadata")
display(data)
or
Mention your schema and load path with .select("*", "_metadata")
df = spark.read \
.format("parquet") \
.schema(schema) \
.load(path) \
.select("*", "_metadata")
display(df)

How to use scala and read file from Azure blob storage?

I'm trying to use the below Scala code to read a csv file from Azure blob storage.
val containerName = "azsqlshackcontainer"
val storageAccountName = "cloudshell162958911"
val sas = "?sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupiyx&se=2022-11-16T17:11:59Z&st=2022-11-16T09:11:59Z&spr=https&sig=ZAy5PeZu5jbICr5B%2BFTLLn6C5TMBxrU5WmbLCRfzNu8%3D"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
dbutils.fs.mount(
source = "wasbs://azsqlshackcontainer#cloudshell162958911.blob.core.windows.net/FoodSales.csv",
mountPoint = "/mnt/myfile",
extraConfigs = Map(config -> sas))
When I run this code, I get below error.
The container name, storage account name and file names are correct.
I replicated the steps given here: https://www.sqlshack.com/accessing-azure-blob-storage-from-azure-databricks/
I'm not sure what I'm missing. I am able to do it using Python, but the scala code is not working
Try this alternative approach with access key, I tried to reproduce it in my environment and I got the below results:
dbutils.fs.mount(
source = "wasbs://<container>#<storage_account_name>.blob.core.windows.net/",
mountPoint = "/mnt/<mount_name>",
extraConfigs = Map("fs.azure.account.key.<<Your_storage_account_name>>.blob.core.windows.net" -> "<< Access_key>>"))
Now, You can check. I can read CSV data with a mount path.
%scala
val df1 = spark.read.format("csv").option("inferSchema", "true").option("Header","true").load("/mnt/dsff")
display(df1)

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm not able to move/copy or rename the files

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm able to delete a folder or file
but not able to move/copy or rename the files , keep getting the error
Below is the snipped I used:
from pyspark.sql import SparkSession
# prepare spark session
spark = SparkSession.builder.appName('filesystemoperations').getOrCreate()
# spark context
sc = spark.sparkContext
# File path declaration
containerName = "ZZZZZZZ"
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=containerName)
tablePath = "/ws/tables/"
evnetName = 'abcd'
tableFilename= "Game_"+eventName+".kql"
tableFile = fullRootPath + tablePath + tableFilename
tempTableFile = fullRootPath + tablePath + tempPath + tableFilename
#empty the paths
sc._jsc.hadoopConfiguration().set('fs.defaultFS', fullRootPath)
fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration()))
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(fullRootPath+tablePath), True)
## This one worked fine.
# dfStringFinalF contains a string
spark.sparkContext.parallelize([dfStringFinalF]).coalesce(1).saveAsTextFile(tempTunctionFile)
fs.copy(tempTunctionFile+'/part-00000' , tableFile)
#Copy , rename, cp, mv, nothing working on fs
Please help
It seems in Azure Synapse Analytics is having limitations with Spark Context and shutil libraries.
mssparkutils library helps in copying/moving files in the BLOB containers.
Here is the code sample
from notebookutils import mssparkutils
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=zzzz)
fsmounting = mssparkutils.fs.mount(
fullRootPath,
"/ws",
{"linkedService":"abcd"}
)
fsMove = mssparkutils.fs.mv(tempTableFile+'/part-00000',tableFile )
More reference from the below link :
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api

FileUtils write method does not work on Azure Databricks

I have troubles writing a file on my Databricks cluster's driver (as a temp file). I have a scala notebook on my company's Azure Databricks which contains those lines of code :
val xml: String = Controller.requestTo(url)
val bytes: Array[Byte] = xml.getBytes
val path: String = "dbfs:/data.xml"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
dbutils.fs.ls("dbfs:/")
val df = spark.read.format("com.databricks.spark.xml")
.option("rowTag", "generic:Obs")
.load(path)
df.show
file.delete()
however it crashes with org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: dbfs:/data.xml. When I run a ls on the root of the dbfs, it doesn't show the file data.xml so for me FileUtils is not doing it's job. What puts me even more in troubles is that the following code works when running it on the same cluster, same Azure resource group, same instance of Databricks, but in another notebook :
val path: String = "mf-data.grib"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
I tried to restart the cluster, remove "dbfs:/" from the path, put the file in the dbfs:/tmp/ directory, use FileUtils.writeStringToFile(file, xml, StandardCharsets.UTF_8) instead of FileUtils.writeByteArrayToFile but none of those solutions has worked, even when combining them.
If you're using local APIs, like, File, you need to use corresponding local file access - instead of using dbfs:/ you need to prefix path with /dbfs/, so your code will look as following:
val file: File = new File(path.replaceFirst("dbfs:", "/dbfs")
Try to remove the dbfs here : val path: String = "dbfs:/data.xml" for understanding purposes I have given 3 different magical command cells %sh , %fs, %scala . You can ref : here

Rename written CSV file Spark

I'm running spark 2.1 and I want to write a csv with results into Amazon S3.
After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename.
I'm using the databricks lib for writing into S3.
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and havent found much.
Thanks
You can use below to rename the output file.
dataframe.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("folder/dataframe/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "folder/dataframe/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"file.csv"))
The code as you mentioned here returns a Unit. You would need to confirm when your Spark application has completed its run (assuming this is a batch case) and then rename
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
You can rename the part files with any specific name using the dbutils command, use the below code to rename the part-generated CSV file, this code works fine for pyspark
x = 'dbfs:mnt/source_path' # your source path
y = 'dbfs:mnt/destination_path' # you destination path
Files = dbutils.fs.ls(x)
#moving or renaming the part-000 CSV file into the normal or specific name
i = 0
for file in Files:
print(file.name)
i = i+1
if file.name[-4] ='.csv': #you can use any file extension like parquet, JSON, etc.
dbutils.fs.mv(x+file.name,y+'OutputData-' + str(i) +'.csv') #you can provide any specific name here
dbutils.fs.rm(x, True) # later remove the source path after renaming all the part-generated files if you want

Resources