I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() in databricks - azure

I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() method in databricks.
json_file = open("abfss://#.dfs.core.windows.net/test.json') databricks is unable to open file present in azure blob container and getting below error :
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://#.dfs.core.windows.net/test.json'
I have done all the configuration setting using service principal. Please suggest other way of opening file using abfss direct path.

the open method works only with local files - it doesn't know anything about abfss or other cloud storages. You have following choice:
use dbutils.fs.cp to copy file from ADLS to local disk of driver node, and then work with it, like: dbutils.fs.cp("abfss:/....", "file:/tmp/my-copy")
Copy file from ADLS to driver node using the Azure SDK
The first method is easier to use than second

Related

Can't access files via Local file API on Databricks

I'm trying to access small text file stored directly on dbfs using local file API.
I'm getting the following error.
No such file or directory
My code:
val filename = "/dbfs/test/test.txt"
for (line <- Source.fromFile(filename).getLines()) {
println(line)
}
At the same time I can access this file without any problems using dbutils or load it to RDD via spark context.
I've tried specifying the path starting with dbfs:/ or /dbfs/ or just with the test folder name, both in Scala and Python, getting the same error each time. I'm running the code from the notebook. Is it some problem with the cluster configuration?
Check if your cluster has Credentials Passthrough enabled. If so, local file Api is not available.
https://docs.azuredatabricks.net/data/databricks-file-system.html#local-file-apis

Azure SSIS IR - working with files in the temp folder of the IR node

I have setup a custom SSIS IR, however I'm having problems reading files from the current working directory or temp folder on the IR node
https://learn.microsoft.com/en-us/sql/integration-services/lift-shift/ssis-azure-files-file-shares?view=sql-server-2017
The work flow of my test package is
Load compressed file to Azure file share
Unzip file
Modify file, saving it the current working group folder on the IR node (this path .\testfile.json)
Load file to Azure SQL DB
The last step is where I'm having issues, I receive the below error message. Maybe looks to be related to security, but no idea how to access the SSIS IR node to check this.
Execute SQL Task:Error: Executing the query "DECLARE #request
VARCHAR(MAX) SELECT #request =..." failed with the following error:
"Cannot bulk load because the file ".\testfile.json" could not be
opened. Operating system error code (null).". Possible failure
reasons: Problems with the query, "ResultSet" property not set
correctly, parameters not set correctly, or connection not established
correctly.
How can I fix this issue?
From just the error message, looks like you're using BULK INSERT in Execute SQL Task to load data into Azure SQL DB. BULK INSERT into Azure SQL DB can only work from Azure Storage Blob, but not from file systems/SSIS IR nodes. To load data from the current working directory of SSIS IR nodes into Azure SQL DB, you can use a Data Flow with Flat File Source and ADO.NET Destination.

Azure Databricks - Unable to read simple blob storage file from notebook

I've set up a cluster with databricks runtime version 5.1 (includes Apache Spark 2.4.0, Scala 2.11) and Python 3. I also installed hadoop azure library (hadoop-azure-3.2.0) to the cluster.
I'm trying to read a blob stored in my blob storage account which is just a text file containing some numeric data delimited by spaces for example. I used the template generated by databricks for reading blob data
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
where file_location is my blob file (https://xxxxxxxxxx.blob.core.windows.net).
I get the following error:
No filesystem named https
I tried using sc.textFile(file_location) to read in an rdd and get the same error.
Your file_location should be in the format:
"wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>"
See: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html
You need to mount the blob with external location to access it via Azure Databricks.
Reference: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs
These three lines of code worked for me:
spark.conf.set("fs.azure.account.key.STORAGE_ACCOUNT.blob.core.windows.net","BIG_KEY")
df = spark.read.csv("wasbs://CONTAINER#STORAGE_ACCOUNT.blob.core.windows.net/")
df.select('*').show()
NOTE that line 2 ends with .net/ because I do not have a sub-folder.

Load props file in EMR Spark Application

I am trying to load custom properties in my spark application using :-
command-runner.jar,spark-submit,--deploy-mode,cluster,--properties-file,s3://spark-config-test/myprops.conf,--num-executors,5,--executor-cores,2,--class,com.amazon.Main,#{input.directoryPath}/SWALiveOrderModelSpark-1.0-super.jar
However, I am getting the following exception:-
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
properties file 's3://spark-config-test/myprops.conf''. at
org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
at
org.apache.spark.launcher.AbstractCommandBuilder.loadPropertiesFile(AbstractCommandBuilder.java:284)
at
org.apache.spark.launcher.AbstractCommandBuilder.getEffectiveConfig(AbstractCommandBuilder.java:264)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:233)
at org
Is this the correct way to load file from S3?
You can't load a properties file directly from S3. Instead you will need to download the properties file to your master node somewhere, then submit the spark job referencing the local path on that node. You can do the download by using command-runner.jar to run the aws cli utility.

AzCopy blob download throwing errors on local machine

I am running the following command while learning how to use AzCopy.
azcopy /Source:https://storeaccountname.blob.core.windows.net/container /Dest:C:\container\ /SourceKey:Key /Pattern:"tdx" /S /V
Some files are downloaded by most files result in an error like the following. I have no idea why this happening and wondered if somebody has encountered this and knows the cause and the fix.
[2016/05/31 21:27:13][ERROR] tdx/logs/site-visit/archive/1463557944558/visit-1463557420000: Failed to open file C:\container\tdx\logs\site-visit\archive\1463557944558\visit-1463557420000: Access to the path 'C:\container\tdx\logs\site-visit\archive\1463557944558\visit-1463557420000' is denied..
My ultimate goal was to create backups of the blobs in a container of one storage account to the container of another storage account. So I am starting out with basics which seem to fail.
Here is a list of folder names from an example path pulled from Azure Portal:
storeaccountname > Blob service > container > app-logs > hdfs > logs
application_1461803569410_0008
application_1461803569410_0009
application_1461803569410_0010
application_1461803569410_0011
application_1461803569410_0025
application_1461803569410_0027
application_1461803569410_0029
application_1461803569410_0031
application_1461803569410_0033
application_1461803569410_0035
application_1461803569410_0037
application_1461803569410_0039
application_1461803569410_0041
application_1461803569410_0043
application_1461803569410_0045
There is an error in the log for each one of these folders that looks like this:
[2016/05/31 21:29:18.830-05:00][VERBOSE] Transfer FAILED: app-logs/hdfs/logs/application_1461803569410_0008 => app-logs\hdfs\logs\application_1461803569410_0008.
[2016/05/31 21:29:18.834-05:00][ERROR] app-logs/hdfs/logs/application_1461803569410_0008: Failed to open file C:\container\app-logs\hdfs\logs\application_1461803569410_0008: Access to the path 'C:\container\app-logs\hdfs\logs\application_1461803569410_0008' is denied..
The folder application_1461803569410_0008 contains two files. Those two files were successfully downloaded. From the logs:
[2016/05/31 21:29:19.041-05:00][VERBOSE] Finished transfer: app-logs/hdfs/logs/application_1461803569410_0008/10.2.0.5_30050 => app-logs\hdfs\logs\application_1461803569410_0008\10.2.0.5_30050
[2016/05/31 21:29:19.084-05:00][VERBOSE] Finished transfer: app-logs/hdfs/logs/application_1461803569410_0008/10.2.0.4_30050 => app-logs\hdfs\logs\application_1461803569410_0008\10.2.0.4_30050
So it appears that the problem is related to copying folders, which themselves are blobs but I can't be certain yet.
There are several known issues when using AzCopy, such as the below which will cause error,
If there are two blobs named “a” and “a/b” under a storage container, copying the blobs under that container with /S will fail. Windows will not allow the creation of folder name “a” and file name “a” under the same folder.
Refer to https://blogs.msdn.microsoft.com/windowsazurestorage/2012/12/03/azcopy-uploadingdownloading-files-for-windows-azure-blobs/. Scroll down to the bottom, see details of Known Issues.
In my container con2, there are a folder named abc.pdf and also a file abc.pdf, when executing Azcopy download command with /S, it will prompt a error message.
Please check your container whether there are folders with the same name as a file.

Resources