How to run spark sql queries using Databricks Cluster through Linux? - linux

I want to execute spark sql commands from Linux Machine on Databricks Cluster. Is there any way to achieve this?
I have set of spark sql commands in a .sql file and want to execute this file using Databricks cluster in Linux Machine.
I am looking something analogous to SQLPLUS, where we make connection with DB and execute sql, in the similar way do we have any utility/solution to execute spark sql over Databricks cluster.

You can connect to a Databricks cluster using ODBC, JDBC, HTTP or thrift protocol. In every case you will need an access token with enough permissions.
I am using IntelliJ DataGrip to connect via JDBC. I had to configure the databricks driver and used this URI.
jdbc:spark://mycompany.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<MY-DATABRICKS-ORGAINZATION-ID>/<MY-DATABRICKS-CLUSTER-ID>;AuthMech=3;UID=token;PWD=<MY-DATABRICKS-TOKEN>
I believe any modern SQL client should be able to connect as Databricks is exposing standard interfaces.
This is the official documentation from databricks
https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html

Related

Why there is no Spark connector for Databricks?

I would like to read data in Databricks with Spark outside of Databricks. But looks like there is no spark connector for Databricks available. Snowflake Connector for Spark is an example, I am looking for something similar for Databricks.
May I know what you mean by connectors ?
Databricks by itself will have the spark clusters running in it.
You can attach notebooks or run a spark job on top of it.
https://docs.databricks.com/notebooks/index.html
If you want to connect your local machine to databricks cluster and do development - you can try below databricks connect.
https://docs.databricks.com/dev-tools/databricks-connect.html

Is there a way to access internal metastore of Azure HDInsight to fire queries on Hive metastore tables?

I am trying to access the internal Hive metastore tables like HIVE.SDS, HIVE.TBLS etc.
I have an HDInsight Hadoop Cluster running with the default internal metastore. From Ambari screen, I got the Advanced setting details required for connections like -
javax.jdo.option.ConnectionDriverName,javax.jdo.option.ConnectionURL,javax.jdo.option.ConnectionUserName as well as the password
When I try connecting to the SQL Server instance(internal hive metastore) instance from a local machine, I get the message to add my IP address to the allowed list. However, since this Azure SQL server is not visible in the list of Azure SQL server dbs in the portal, it is not possible for me to whitelist my IP.
So, I tried logging in via the secure shell user- SSHUSER into the Cluster and tried accessing the HIVE database from within the Cluster using the credentials of metastore provided in Ambari. I am still not able to access it. I am using sqlcmd to connect to sql sever.
Does HDInsight prevent direct access to internal Metastores? Is External Metastore the only way to move ahead? Any leads would be helpful.
Update- I created an external SQL Server instance and used it as an external metastore and was able to access it programatically.
No luck with the Internal one yet.
There is not a way to access internal metastores for HDInsight cluster. The internal metastores live in the internal subscription which only PGs are able to access.
If you want to have more control on you metastores it is recommended to bring your own "external" metastore.

Common metadata in databricks cluster

I have a 3-4 clusters in my databricks instance of Azure cloud platform. I want to maintain a common metastore for all the cluster. Let me know if anyone implemented this.
I recommend configuring an external Hive metastore. By default, Detabricks spins its own metastore behind the scenes. But you can create your own database (Azure SQL does work, also MySQL or Postgres) and specify it during the cluster startup.
Here are detailed steps:
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
Things to be aware of:
Data tab in Databricks - you can choose the cluster and see different metastores.
To avoid using SQL user&password, look at Managed Identities https://learn.microsoft.com/en-us/azure/stream-analytics/sql-database-output-managed-identity
Automate external Hive metastore connections by using initialization scripts for your cluster
Permissions management on your sources. In case of ADLS Gen 2, consider using password pass-through

apache superset connecting to databricks delta lake

I am trying to read data from databricks delta lake via. apache superset. I can connect to delta lake with a JDBC connection string supplied by the cluster but superset seems to require a sql alchemy string so I'm not sure what I need to do to get this working. Thank you, anything helps
superset database setup
Have you tried this?
https://flynn.gg/blog/databricks-sqlalchemy-dialect/
Thanks to contributions by Evan Thomas, the Python databricks-dbapi
package now supports using Databricks as a SQL dialect within
SQLAlchemy. This is particularly useful for hooking up Databricks to a
dashboard frontend application like Apache Superset. It provides
compatibility with both standard Databricks and Azure Databricks.
Just use pyhive and you should be ready to connect to databricks thrift JDBC server.

SparkSession Connect to Databricks Azure

I'm using maven and scala to create a spark application that needs to connect to a cluster on azure databricks.
How can i point my sparksession to connect to the databricks cluster?
I saw databricks-connect, but it loads some jar files using sbt.
I don't understand how it achives that connectivity exactly.
My use case requires to run a spark job programmatically on the db cluster upon request, so i need to be able to connect it there.

Resources