Creating Spark job in Data Lake instead of a U-SQL job - azure

Is it possible to create Spark job in Data Lake instead of a U-SQL job ?

Here are the options for big data services in Azure:
Azure Data Lake Analytics currently has only U-SQL jobs and not Spark.
Azure HDInsight supports Spark jobs.
Azure Databricks supports Spark jobs.
**Cloudera on Azure ** supports Spark jobs.
**Hortonworks on Azure ** supports Spark jobs.

Related

Azure synapse equivalent commands

Looking for below mentioned equivalent command in Azure synapse analytics notebook. Below is from databricks.
''''1.spark.conf.set("spark.databricks.io.cache.enabled", "true")
''''2.spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","true")
''''3.spark.conf.set("spark.databricks.delta.autoCompact.enabled","true")
These are some Delta cache and auto optimization features which are applicable to Databricks only. Databricks is code oriented while Synapse allows UI based analytics.
Synapse is a separate data Analytics service provided by Azure which has different feature sets. It's not mandatory that Databricks and Synapse will share same features.
Synapse Apache Spark allows you to optimize components like data serialization, joins, shuffle, job execution, etc. These components are different from Databricks but the objective are quite similar.
You can learn more in details about Optimize Apache Spark jobs in Azure Synapse Analytics.
Got the solution
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1 * 1024 * 1024 * 1024)

Install sql-spark-connector library to Azure Synapse Apache Spark

I am trying to install the Apache Spark Connector for SQL Server and Azure SQL to use transactional data in big data analytics and persists results for ad-hoc queries or reporting. The connector allows you to use any SQL database, on-premises or in the cloud, as an input data source or output data sink for Spark jobs.
The spark sql connector is located here https://github.com/microsoft/sql-spark-connector
Can someone let me know how to import it in Azure Synapse Apache Spark?
As per the conversation with Synapse Product Group:
You don’t need to add the connector Apache Spark connector jar files or any package com.microsoft.sqlserver.jdbc.spark to your Synapse Spark pool. The connector is there out of the box for Spark 2.4 and for Spark 3.1 it will be in production most likely in upcoming weeks.
For more details, refer to the Microsoft Q&A thread which addressing similar issue.

Databricks notebooks lineage in Azure Purview

If I read file from ADLS into PySpark data frame and write back to another ADLS folder in different file format, will that lineage captured in Hive metastore, Can lineage show for this kind of operations?
Currently this lineage won't show up out of the box - however, Purview uses Atlas behind the scenes, thus you can probably capture this lineage using the API.
Here's an example of where Spline was used to track lineage from notebooks:
https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/
This article talks about how to get started with the Purview REST API:
https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058
You can use the OpenLineage based Databricks to Purview Solution Accelerator to ingest the lineage provided by Databricks. By deploying the solution accelerator, you'll have a set of Azure Functions and a Databricks cluster that can extract the logical plan from a Databricks notebook / job and transform it automatically to Apache Atlas / Microsoft Purview entities.
Supports table level lineage from Spark Notebooks and jobs for the following data sources:
Azure SQL
Azure Synapse Analytics
Azure Data Lake Gen 2
Azure Blob Storage
Delta Lake
Supports Spark 3.1 and 3.0 (Interactive and Job clusters) / Spark 2.x (Job clusters)
Databricks Runtimes between 6.4 and 10.3 are currently supported
Can be configured per cluster or for all clusters as a global configuration
Once configured, does not require any code changes to notebooks or jobs

How to connect Spark Structured Streaming to blob/file creation events from Azure Data Lake Storage Gen2 or Blob Storage

I am new to Spark Structured Streaming and its concepts. Was reading through the documentation for Azure HDInsight cluster here and it's mentioned that the structured streaming applications run on HDInsight cluster and connects to streaming data from .. Azure Storage, or Azure Data Lake Storage. I was looking at how to get started with the streaming listening to new file created events from the storage or ADLS. The spark documentation does provide an example, but i am looking for how to tie up streaming with the blob/file creation event, so that I can store the file content in a queue from my spark job. It will be great if anyone can help me out on this.
happy to help you on this, but can you be more precise with the requirement. Yes, you can run the Spark Structured Streaming jobs on Azure HDInsight. Basically mount the azure blob storage to cluster and then you can directly read the data available in the blob.
val df = spark.read.option("multiLine", true).json("PATH OF BLOB")
Azure Data Lake Gen2 (ADL2) has been released for Hadoop 3.2 only. Open Source Spark 2.4.x supports Hadoop 2.7 and if you compile it yourself Hadoop 3.1. Spark 3 will support Hadoop 3.2, but it's not released yet (only preview release).
Databricks offers support for ADL2 natively.
My solution to tackle this problem was to manually patch and compile Spark 2.4.4 with Hadoop 3.2 to be able to use the ADL2 libs from Microsoft.

CDAP with Azure Data bricks

Has anyone tried using Azure data bricks as the spark cluster for CDAP job processing. CDAP documentation details how to add it to Azure HDInsight, but just wondering is there a way to configure CDAP to point to data bricks spark cluster, is it even possible? OR this kind of integration needs a specific data bricks client connector jar? If anyone has any insights that would be helpful.
There is no out of box support for Databricks spark on Azure. But, that said you can develop a new Cloud Runtime that is capable of submitting the jobs to Databricks spark cluster. Here is example of how to write a runtime extension for Cloud Dataproc and EMR.

Resources