I have this situation to solve and I am not sure how to proceed.
We have a Databricks instance running on Microsoft Azure. On Databricks, there also are multiple databases which can be accessed with Databricks' Apache Spark client (PySpark). We have a working code which runs fine when running directly in Databricks.
How do we connect our local clients (or pipelines in Microsoft Azure Development?) to execute the code in Databricks?
The idea is to come up with a solution on how to run the code, which can only be ran in Databricks, from local clients or from the pipeline?
I have looked into Databricks Connect but that does not seem like a library which could solve our issue.
Thank you
Related
Is there any way to create Databricks jobCluster through Databricks connect?
We are using All purposed cluster so far, to reduce Databricks cost we are planning to go ahead with jobCluster but unfortunately, I couldn't find a way to create jobCluster through Databricks connect.
Or is there any alternate way to by-pass the Databricks connect and create job cluster through IDE (PyCharm)
Databricks Connect is designed to work only with interactive clusters, as it itself is also used for interactive work. So you can't use it with job clusters.
Alternatively, you can look onto the dbx tool from Databricks labs that allows you to use local IDE to develop a code, but then run the tests on the job or interactive clusters. But take into account that it won't allow you to debug interactively the code running on the Databricks - you can do it only against the local Spark.
I have a requirement to develop an application in python. The python application will interact with any database and execute sql statements against it. It can also interact with Databricks instance too and query the tables in databricks.
The requirement is that the python application should be platform independent. So the application is developed in such a way that if it runs on databricks, only then it will trigger the spark specific code with in the application. If it is run on a standalone node, it skips. The python programs interacts with Azure blob storages for accessing some files/folders. The python application is deployed on Standalone Node/Databricks as a Wheel.
The issue here is with custom logging. I have implemented custom logging in the python application. There are two scenarios here based on where the application is being run.
Standalone Node
Databricks Cluster.
If the code is run on Standalone Node, then the custom log is initially getting logged into local OS folder and after the application completes successfully/fails, it is moved to azure blob storage. But for some reason if it fails to move the log file to azure storage, it is still available in the local file system of Standalone Node.
If the same approach is followed on Databricks, if the application fails to upload the log file to blob storage, we cannot recover it as the databricks OS storage is volatile. I tried to write the log to dbfs. But It doesn't allow to append.
Is there a way to get the application logs from databricks? Is there a possibility that the databricks can record my job execution and store the logs? As I mentioned, the python application is deployed as wheel and it contains very limited spark code.
Is there a way to get the application logs from databricks? Is there a
possibility that the databricks can record my job execution and store
the logs?
I think you are able to do that now , but once the cluster is shut down ( to minimize cost ) , the logs will be gone . I am thankful to you to share that the logs in DBFS can only be appended , i was not aware about that .
Is your standalone application is open to internet , if yes than may be you can explore the option of writing the logs to Azure event hub . You can write to eventhub from ADb and standalone application and then write that to blob etc for further visualization .
This tutorial should get you started .
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-python-get-started-send
HTH
Databricks secrets can be accessed within notebooks using dbutils, however since dbutils is not available outside notebooks how can one access secrets in pyspark/python jobs, especially if they are run using mlflow.
I have already tried How to load databricks package dbutils in pyspark
which does not work for remote jobs or mlflow project runs.
In raw pyspark you cannot do this. However if you are developing a pyspark application specifically for Databricks then I strongly recommend you look at Databricks-connect.
This allows access to parts of dbutils including secrets from an ide. It also simplifies how you access storage so that it aligns with how the code will run in production.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect
I have a requirement to parse a lot of small files and load them into a database in a flattened structure. I prefer to use ADF V2 and SQL Database to accomplish it. The file parsing logic is already available using Python script and I wanted to orchestrate it in ADF. I could see an option of using Python Notebook connector to Azure Databricks in ADF v2. May I ask if I will be able to just run a plain Python script in Azure Databricks through ADF? If I do so, will I just run the script in Databricks cluster's driver only and might not utilize the cluster's full capacity. I am also thinking of calling Azure functions as well. Please advise which one is more appropriate in this case.
Just provide some ideas for your reference.
Firstly, you are talking about Notebook and Databricks which means ADF's own copy activity and Data Flow can't meet your needs, since as i know, ADF could meet just simple flatten feature! If you miss that,please try that first.
Secondly,if you do have more requirements beyond ADF features, why not just leave it?Because Notebook and Databricks don't have to be used with ADF,why you want to pay more cost then? For Notebook, you have to install packages by yourself,such as pysql or pyodbc. For Azure Databricks,you could mount azure blob storage and access those files as File System.In addition,i suppose you don't need many workers for cluster,so just configure it as 2 for max.
Databricks is more suitable for managing as a job i think.
Azure Function also could be an option.You could create a blob trigger and load the files into one container. Surely,you have to learn the basic of azure function if you are not familiar with it.However,Azure Function could be more economical.
Recently, Databricks launched Databricks Connect that
allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session.
It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this:
spark.read.json("abfss://...").count()
I get this error:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
Does anybody know how to fix this?
Further information:
databricks-connect version: 5.3.1
If you mount the storage rather use a service principal you should find this works: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake-gen2.html
I posted some instructions around the limitations of databricks connect here. https://datathirst.net/blog/2019/3/7/databricks-connect-limitations
Likely too late but for completeness' sake, there's one issue to look out for on this one. If you have this spark conf set, you'll see that exact error (which is pretty hard to unpack):
fs.abfss.impl org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem
So you can double check the spark configs to make sure you have the permissions to directly access ADLS gen2 using the storage account access key.