Reading (txt , csv) FIle from Azure blob storage using pyspark - azure

I am trying to read file from Azure blob Storage to Azure HDInsight Cluster using PySpark and getting this error:

From the Azure portal, try opening the session again. You may be using the Jupyter notebook which you may try to open and then run the commands. It should work fine.

Related

Mounting onprem datastore to Azure Databricks dbfs

I am looking for a solution to mount local storage which is on on premise hadoop cluster that can be used with databricks to mount onto dbfs:/// directly instead of loading to azure blob storage and then mounting it to databricks. Any advice here would be helpful. Thank You
I am in research side and still have not figured a way to come up with solution. I am not sure even if its possible with out azure storage account.
Unfortunately, mounting on Prem datastore to Azure Databricks not supported.
You can try these alternative methods:
Method 1:
Connecting local files on a remote data bricks spark cluster access local file with DBFS. Refer this MS Document.
Method 2:
Alternative use azure Databricks CLI or REST API and push local data to a location on DBFS, where it can be read into Spark from within a Databricks notebook.
For more information refer this Blog by Vikas Verma

Connect Pentaho to Azure blob storage or ADLS

Is there any way to connect pentaho to Azure Blob Storage or ADLS? Because I am not able to find any option?
Go to the file in Azure Blob and Generate SAS token and URL and copy the URL only. In PDI, select Hadoop file input. Double click the Hadoop file input and select local for the Evironment and insert the Azure URL in File/Folder field and that's it. You should see the file in PDI.
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/authentication-wasb.html
This article explains how to connect the Pentaho Server to a Microsoft Azure HDInsight cluster. Pentaho supports both the HDFS file system and the WASB (Windows Azure Storage BLOB) extension for Azure HDInsight.

Move Files from Azure Files to ADLS Gen 2 and Back using Databricks

I have a Databricks process which currently generate a bunch of text files which gets stored in Azure Files. These files need to be moved to ADLS Gen 2 on a scheduled basis and back to File Share.
How this can be achieved using Databricks?
Installing the azure-storage package and using the Azure Files SDK for Python on Azure Databricks is the only way to access files in Azure Files.
Install Library: file-share azure-storage https://pypi.org/project/azure-storage-file-share/
Note : Pip install only instals the package on the driver node, thus pandas must be loaded first. The library must be deployed as a Databricks Library before it can be used by Spark worker nodes.
Python - Load file from Azure Files to Azure Databricks - Stack Overflow
Alternative could be copying the data from Azure File Storage to ADLS2 via Azure DataFactory using Copy activity : Copy data from/to Azure File Storage - Azure Data Factory & Azure Synapse | Microsoft Docs

Create Azure databricks notebook from storage account

We have python script stored in Azure storage account in blob. We want to deploy / create this python script (as notebook) in azure databricks cluster so later we can run Azure data factory pipeline and pipeline can execute notebook created/deployed in databricks.
We want to create / deploy this script only one time as and when its available in blob.
I have tried to search over the web but couldn't find proper solution for this.
Is it possible to deploy/create notebook from storage account? if yes, how?
Thank you.
You can import notebook into Databricks using the URL, but I expect that you won't make that notebook public.
Another solution would be to use a combination of azcopy tool with Databricks CLI (workspace sub-command). Something like this:
azcopy cp "https://[account].blob.core.windows.net/[container]/[path/to/script.py" .
databricks workspace import -l PYTHON script.py '<location_on_databricks>'
You can also do it completely in notebook, combining the dbutils.fs.cp command with Databricks's Workspace REST API, but that's could be more complicated as you need to get personal access token, base64 the notebook, etc.
We can use databricks API 2.0 to import python script in databricks cluster.
Here is the API definition: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/workspace#--import

HDInsight: Spark - how to upload a file?

I have an new HDInisght:Spark cluster that I spun up
I want to upload a file via AMbari portal but I don't see the HDFS option:
What am I missing? How can I get my .csv up to the server so I can start using it in the Python notebook?
HDInsight clusters do not work off local HDFS. They use Azure Blob Storage instead. So upload to the storage account that got attached to the cluster during its creation.
More info:
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage

Resources