We are working on Azure ML and ADLS combination. Since HDInsight Cluster is working over ADLS, we are trying to use Hive Query and HDFS route and running into problems.
Request your help in solving the problem of reading data from hive query and writing to HDFS. Below is the error URL for reference:
https://studioapi.azureml.net/api/sharedaccess?workspaceId=025ba20578874d7086e6c495cc49a3f2&signature=ZMUCNMwRjlrksrrmsrx5SaGedSgwMmO%2FfSHvq190%2F1I%3D&sharedAccessUri=https%3A%2F%2Fesprodussouth001.blob.core.windows.net%2Fexperimentoutput%2Fccf9a206-730d-4773-b44e-a2dd8c6e87b9%2Fccf9a206-730d-4773-b44e-a2dd8c6e87b9.txt%3Fsv%3D2015-02-21%26sr%3Db%26sig%3DHkuFm8B2Ba1kEWWIwanqlv%2FcQPWVz0XYveSsZnEa0Wg%3D%26st%3D2017-10-16T18%3A31%3A06Z%26se%3D2017-10-17T18%3A36%3A06Z%26sp%3Dr
Azure Machine Learning supports Hive but not over ADLS.
Related
I would like to read data in Databricks with Spark outside of Databricks. But looks like there is no spark connector for Databricks available. Snowflake Connector for Spark is an example, I am looking for something similar for Databricks.
May I know what you mean by connectors ?
Databricks by itself will have the spark clusters running in it.
You can attach notebooks or run a spark job on top of it.
https://docs.databricks.com/notebooks/index.html
If you want to connect your local machine to databricks cluster and do development - you can try below databricks connect.
https://docs.databricks.com/dev-tools/databricks-connect.html
I am trying to install the Apache Spark Connector for SQL Server and Azure SQL to use transactional data in big data analytics and persists results for ad-hoc queries or reporting. The connector allows you to use any SQL database, on-premises or in the cloud, as an input data source or output data sink for Spark jobs.
The spark sql connector is located here https://github.com/microsoft/sql-spark-connector
Can someone let me know how to import it in Azure Synapse Apache Spark?
As per the conversation with Synapse Product Group:
You don’t need to add the connector Apache Spark connector jar files or any package com.microsoft.sqlserver.jdbc.spark to your Synapse Spark pool. The connector is there out of the box for Spark 2.4 and for Spark 3.1 it will be in production most likely in upcoming weeks.
For more details, refer to the Microsoft Q&A thread which addressing similar issue.
If I read file from ADLS into PySpark data frame and write back to another ADLS folder in different file format, will that lineage captured in Hive metastore, Can lineage show for this kind of operations?
Currently this lineage won't show up out of the box - however, Purview uses Atlas behind the scenes, thus you can probably capture this lineage using the API.
Here's an example of where Spline was used to track lineage from notebooks:
https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/
This article talks about how to get started with the Purview REST API:
https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058
You can use the OpenLineage based Databricks to Purview Solution Accelerator to ingest the lineage provided by Databricks. By deploying the solution accelerator, you'll have a set of Azure Functions and a Databricks cluster that can extract the logical plan from a Databricks notebook / job and transform it automatically to Apache Atlas / Microsoft Purview entities.
Supports table level lineage from Spark Notebooks and jobs for the following data sources:
Azure SQL
Azure Synapse Analytics
Azure Data Lake Gen 2
Azure Blob Storage
Delta Lake
Supports Spark 3.1 and 3.0 (Interactive and Job clusters) / Spark 2.x (Job clusters)
Databricks Runtimes between 6.4 and 10.3 are currently supported
Can be configured per cluster or for all clusters as a global configuration
Once configured, does not require any code changes to notebooks or jobs
I am trying to read data from databricks delta lake via. apache superset. I can connect to delta lake with a JDBC connection string supplied by the cluster but superset seems to require a sql alchemy string so I'm not sure what I need to do to get this working. Thank you, anything helps
superset database setup
Have you tried this?
https://flynn.gg/blog/databricks-sqlalchemy-dialect/
Thanks to contributions by Evan Thomas, the Python databricks-dbapi
package now supports using Databricks as a SQL dialect within
SQLAlchemy. This is particularly useful for hooking up Databricks to a
dashboard frontend application like Apache Superset. It provides
compatibility with both standard Databricks and Azure Databricks.
Just use pyhive and you should be ready to connect to databricks thrift JDBC server.
The Docs says "Every Databricks deployment has a central Hive metastore..." besides an external metastore for existing Hive installations.
I have an Azure Databricks workspace with an underlying spark cluster, and a datafiles stored on DBFS and Blob Storage. Do I need HDInsight cluster with external metastore to be able to create and use Hive tables? Or can I use the above mentioned central metastore to create Hive tables on data stored on DBFS or Blob storage?
#Gadam nope you do not. Azure Databricks provisions its own Hive Metastore, but if you are already using one with HDInsight, Databricks can be configured to also use it (an external metastore).