Is there any way to create Databricks jobCluster through Databricks connect? - apache-spark

Is there any way to create Databricks jobCluster through Databricks connect?
We are using All purposed cluster so far, to reduce Databricks cost we are planning to go ahead with jobCluster but unfortunately, I couldn't find a way to create jobCluster through Databricks connect.
Or is there any alternate way to by-pass the Databricks connect and create job cluster through IDE (PyCharm)

Databricks Connect is designed to work only with interactive clusters, as it itself is also used for interactive work. So you can't use it with job clusters.
Alternatively, you can look onto the dbx tool from Databricks labs that allows you to use local IDE to develop a code, but then run the tests on the job or interactive clusters. But take into account that it won't allow you to debug interactively the code running on the Databricks - you can do it only against the local Spark.

Related

How to run Python code in Azure Databricks from local client

I have this situation to solve and I am not sure how to proceed.
We have a Databricks instance running on Microsoft Azure. On Databricks, there also are multiple databases which can be accessed with Databricks' Apache Spark client (PySpark). We have a working code which runs fine when running directly in Databricks.
How do we connect our local clients (or pipelines in Microsoft Azure Development?) to execute the code in Databricks?
The idea is to come up with a solution on how to run the code, which can only be ran in Databricks, from local clients or from the pipeline?
I have looked into Databricks Connect but that does not seem like a library which could solve our issue.
Thank you

How to run spark sql queries using Databricks Cluster through Linux?

I want to execute spark sql commands from Linux Machine on Databricks Cluster. Is there any way to achieve this?
I have set of spark sql commands in a .sql file and want to execute this file using Databricks cluster in Linux Machine.
I am looking something analogous to SQLPLUS, where we make connection with DB and execute sql, in the similar way do we have any utility/solution to execute spark sql over Databricks cluster.
You can connect to a Databricks cluster using ODBC, JDBC, HTTP or thrift protocol. In every case you will need an access token with enough permissions.
I am using IntelliJ DataGrip to connect via JDBC. I had to configure the databricks driver and used this URI.
jdbc:spark://mycompany.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<MY-DATABRICKS-ORGAINZATION-ID>/<MY-DATABRICKS-CLUSTER-ID>;AuthMech=3;UID=token;PWD=<MY-DATABRICKS-TOKEN>
I believe any modern SQL client should be able to connect as Databricks is exposing standard interfaces.
This is the official documentation from databricks
https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html

Local instance of Databricks for development

I am currently working on a small team that is developing a Databricks based solution. For now we are small enough to work off of cloud instances of Databricks. As the group grows this will not really be practical.
Is there a "local" install of Databricks that can be installed for development purposes (it doesn't need to be a scalable version but does need to be essentially fully featured)? In other words, is there a way each developer can create their own development instance of Databricks on their local machine?
Is there another way to provide a dedicated Databricks environment for each developer?
Databricks, as a cloud-deployed platform, leverages many cloud technologies in its deployment. For example, Auto Loader incrementally ingests new data files as they arrive in AWS using EventBridge, SNS and S3, while Azure uses EventHubs, Notification Hubs and ADLS technologies. They aim to create a seamless look and feel across AWS, Azure and GCP but can do this only in the cloud.
For local deployment, you may be able to use Apache Spark and MlFlow and create a similar experience, but the notebook experience isn't open source. The workflow of Databricks is proprietary, though Databricks has open-sourced many of its technologies, like Delta Lake. The local Spark, MlFlow, may suffice for some and then use the cloud sparingly, but the seamless workflow offered by Databricks is challenging to replicate outside of the leading cloud vendors.

Access databricks secrets in pyspark/python job

Databricks secrets can be accessed within notebooks using dbutils, however since dbutils is not available outside notebooks how can one access secrets in pyspark/python jobs, especially if they are run using mlflow.
I have already tried How to load databricks package dbutils in pyspark
which does not work for remote jobs or mlflow project runs.
In raw pyspark you cannot do this. However if you are developing a pyspark application specifically for Databricks then I strongly recommend you look at Databricks-connect.
This allows access to parts of dbutils including secrets from an ide. It also simplifies how you access storage so that it aligns with how the code will run in production.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect

Custom Script in Azure Data Factory & Azure Databricks

I have a requirement to parse a lot of small files and load them into a database in a flattened structure. I prefer to use ADF V2 and SQL Database to accomplish it. The file parsing logic is already available using Python script and I wanted to orchestrate it in ADF. I could see an option of using Python Notebook connector to Azure Databricks in ADF v2. May I ask if I will be able to just run a plain Python script in Azure Databricks through ADF? If I do so, will I just run the script in Databricks cluster's driver only and might not utilize the cluster's full capacity. I am also thinking of calling Azure functions as well. Please advise which one is more appropriate in this case.
Just provide some ideas for your reference.
Firstly, you are talking about Notebook and Databricks which means ADF's own copy activity and Data Flow can't meet your needs, since as i know, ADF could meet just simple flatten feature! If you miss that,please try that first.
Secondly,if you do have more requirements beyond ADF features, why not just leave it?Because Notebook and Databricks don't have to be used with ADF,why you want to pay more cost then? For Notebook, you have to install packages by yourself,such as pysql or pyodbc. For Azure Databricks,you could mount azure blob storage and access those files as File System.In addition,i suppose you don't need many workers for cluster,so just configure it as 2 for max.
Databricks is more suitable for managing as a job i think.
Azure Function also could be an option.You could create a blob trigger and load the files into one container. Surely,you have to learn the basic of azure function if you are not familiar with it.However,Azure Function could be more economical.

Resources