How to submit custom spark application on Azure Databricks? - apache-spark

I have created a small application that submits a spark job at certain intervals and creates some analytical reports. These jobs can read data from a local filesystem or a distributed filesystem (fs could be HDFS, ADLS or WASB). Can I run this application on Azure databricks cluster?
The application works fine on HDInsights cluster as I was able to access the nodes. I kept my deployable jar at one location, started it using the start-script similarly I could also stop it using the stop-script that I prepared.
One thing I found is that Azure Databricks has its own File System: ADFS, I can also add support for this file system but then will I be able to deploy and run my application as I was able to do it on the HDInsight cluster? If not, is there a way I can submit jobs from an edge node, my HDInsight cluster or any other OnPrem Cluster to Azure Databricks cluster.

Have you looked at Jobs? https://docs.databricks.com/user-guide/jobs.html. You can submit jars to spark-submit just like on HDInsight.
Databricks file system is DBFS - ABFS is used for Azure Data Lake. You should not need to modify your application for these - the file paths will be handled by databricks.

Related

How to log custom Python application logs in Databricks and move it to Azure

I have a requirement to develop an application in python. The python application will interact with any database and execute sql statements against it. It can also interact with Databricks instance too and query the tables in databricks.
The requirement is that the python application should be platform independent. So the application is developed in such a way that if it runs on databricks, only then it will trigger the spark specific code with in the application. If it is run on a standalone node, it skips. The python programs interacts with Azure blob storages for accessing some files/folders. The python application is deployed on Standalone Node/Databricks as a Wheel.
The issue here is with custom logging. I have implemented custom logging in the python application. There are two scenarios here based on where the application is being run.
Standalone Node
Databricks Cluster.
If the code is run on Standalone Node, then the custom log is initially getting logged into local OS folder and after the application completes successfully/fails, it is moved to azure blob storage. But for some reason if it fails to move the log file to azure storage, it is still available in the local file system of Standalone Node.
If the same approach is followed on Databricks, if the application fails to upload the log file to blob storage, we cannot recover it as the databricks OS storage is volatile. I tried to write the log to dbfs. But It doesn't allow to append.
Is there a way to get the application logs from databricks? Is there a possibility that the databricks can record my job execution and store the logs? As I mentioned, the python application is deployed as wheel and it contains very limited spark code.
Is there a way to get the application logs from databricks? Is there a
possibility that the databricks can record my job execution and store
the logs?
I think you are able to do that now , but once the cluster is shut down ( to minimize cost ) , the logs will be gone . I am thankful to you to share that the logs in DBFS can only be appended , i was not aware about that .
Is your standalone application is open to internet , if yes than may be you can explore the option of writing the logs to Azure event hub . You can write to eventhub from ADb and standalone application and then write that to blob etc for further visualization .
This tutorial should get you started .
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-python-get-started-send
HTH

Customizing nodes of an Azure Synapse Workspace Spark Cluster

When creating a Spark cluster within an Azure Synapse workspace, is there a means to install arbitrary files and directories onto it's cluster nodes and/or onto the node's underlying distributed filesystem?
By arbitrary files and directories, I literally mean arbitrary files and directories; not just extra Python libraries like demonstrated here.
Databricks smartly provided a means to do this on it's cluster nodes (described in this document). Now I'm trying to see if there's a means to do the same on an Azure Synapse Workspace Spark Cluster.
Thank you.
Unfortunately, Azure Synapse Analytics don't support arbitrary binary installs or writing to Spark local storage.
I would suggest you to provide feedback on the same:
https://feedback.azure.com/forums/307516-azure-synapse-analytics
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

How does HDInsight cluster maps to Azure Storage as HDFS?

I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/

CDAP with Azure Data bricks

Has anyone tried using Azure data bricks as the spark cluster for CDAP job processing. CDAP documentation details how to add it to Azure HDInsight, but just wondering is there a way to configure CDAP to point to data bricks spark cluster, is it even possible? OR this kind of integration needs a specific data bricks client connector jar? If anyone has any insights that would be helpful.
There is no out of box support for Databricks spark on Azure. But, that said you can develop a new Cloud Runtime that is capable of submitting the jobs to Databricks spark cluster. Here is example of how to write a runtime extension for Cloud Dataproc and EMR.

Specify Azure key in Spark 2.x version

I'm trying to access a wasb(Azure blob storage) file in Spark and need to specify the account key.
How do I specify the account in the spark-env.sh file?
fs.azure.account.key.test.blob.core.windows.net
EC5sNg3qGN20qqyyr2W1xUo5qApbi/zxkmHMo5JjoMBmuNTxGNz+/sF9zPOuYA==
WHen I try this it throws the following error
fs.azure.account.key.test.blob.core.windows.net: command not found
From your description, it is not clear that the Spark you used is either on Azure or on local.
For Spark running on local, refer this blog post which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your classpath for accessing HDFS via the protocol wasb[s].
For Spark running on Azure, the difference is just only access HDFS with wasb, all configurations have been done by Azure when creating HDInsight cluster with Spark.

Resources