Azure Synapse spark read from default storage - apache-spark

We are working on an Azure Synapse Analytics project with CI/CD pipeline. I want to read data with serverless spark-pool from storage account, but not specify the storage account name. Is this possible? We are using the default storage account but a separate container for datalake data.
I can read data with spark.read.parquet('abfss://{container_name}#{account_name}.dfs.core.windows.net/filepath.parquet) but since the name of the storage account is different between dev test and prod this will need to be parameterized and I would like to avoid it if possible. Is there any native spark way to do this? I found some documentation about doing this with pandas and FSSPEC but not with only spark.

Related

Is it possible to use an existing storage account to create a Databricks workspace?

We need to create a shared storage account for Synapse and Databricks, however we can only use existing storage accounts in Synapse while Databricks creates separate resource groups on its own and there is no option to use existing one also why there the managed resource group created by databricks have locks in it?
There are two things regarding storage accounts & Databricks:
Databricks automatically creates a storage account for each workspace to hold so-called DBFS Root. This storage account is meant to be used to keep only temporary data, libraries, cluster logs, models, etc. It's not designed to keep the production data as this storage account isn't accessible outside of the Databricks workspace.
Databricks can work with storage accounts created outside of the workspace (documentation) - just create a dedicated storage account to keep your data, and access it using the abfss protocol as described in the documentation, or mount it into workspace (although it's not recommended anymore). And then you can access that storage account from Synapse & other tools as well.

Is there a simple way to ETL from Azure Blob Storage to Snowflake EDW?

I have the following ETL requirements for Snowflake on Azure and would like to implement the simplest possible solution because of timeline and technology constraints.
Requirements :
Load CSV data (only a few MBs) from Azure Blob Storage into Snowflake Warehouse daily into a staging table.
Transform the loaded data above within Snowflake itself where transformation is limited to just a few joins and aggregations to obtain a few measures. And finally, park this data into our final tables in a Datamart within the same Snowflake DB.
Lastly, automate the above pipeline using a schedule OR using an event based trigger (i.e. steps to kick in as soon as file lands in Blob Store).
Constraints :
We cannot use use Azure Data Factory to achieve this simplest design.
We cannot use Azure Functions to deploy Python Transformation scripts and schedule them either.
And, I found that Transformation using Snowflake SQL is a limited feature where it only allows certain things as part of COPY INTO command but does not support JOINS and GROUP BY. Furthermore, although the following THREAD suggests that scheduling SQL is possible, but that doesn't address my Transformation requirement.
Regards,
Roy
Attaching the following Idea diagram for more clarity.
https://community.snowflake.com/s/question/0D50Z00009Z3O7hSAF/how-to-schedule-jobs-from-azure-cloud-for-loading-data-from-blobscheduling-snowflake-scripts-since-dont-have-cost-for-etl-tool-purchase-for-scheduling
https://docs.snowflake.com/en/user-guide/data-load-transform.html#:~:text=Snowflake%20supports%20transforming%20data%20while,columns%20during%20a%20data%20load.
You can create snowpipe on Azure blob storage, Once snowpipe created on top of your azure blob storage, It will monitor bucket and file will be loaded into your stage table as soon as new file comes in. After copied the data into stage table you can schedule transformation SQL using snowflake task.
You can refer snowpipe creation step for azure blob storage in below link:
Snowpipe on microsoft Azure blob storage

SPARK : How to access AzureFileSystemInstrumentation when using azure blob storage with spark cluster?

I am working on a spark project where the storage sink is Azure Blob Storage. I write data in parquet format. I need some metrics around storage, eg. numberOfFilesCreated, writtenBytes etc. On searching for it online I came across a particular metrics that the hadoop-azure package has called the AzureFileSystemInstrumentation. I am not sure about how to access the same from spark and can't find any resources for the same. How would one access this instrumentation for the given spark job?
Based on my experience, I think there are three solution can be used in your current scenario, as below.
Directly use Hadoop API for HDFS to get HDFS Metrics Data in Spark, because hadoop-azure just implements the HDFS APIs for using Azure Blob Storage, so please see the Hadoop offical document for Metrics to know what particular metrics you want to use, such as CreateFileOps or FilesCreated as the figure below to get numberOfFilesCreated. Meanwhile, there is a similar SO thread How do I get HDFS bytes read and write for Spark applications? which you can refer to.
Directly use Azure Storage SDK for Java or other languages you used to write a program to do the statistics for files stored in Azure Blob Storage as blobs ordered by creation timestamp or others, please refer to the offical document Quickstart: Azure Blob storage client library v8 for Java to know how to use its SDK.
Use Azure Function with Blob Trigger to monitor the events of files created in Azure Blob Storage, then you can write the code for statistics on every blob created event, please refer to the offical document Create a function triggered by Azure Blob storage to know how to use Blob Trigger. Even, you can send these metrics what you want to Azure Table Storage or Azure SQL Database or other services for statistics later in the Azure Blob Trigger Function.

Connection between Azure Data Factory and Databricks

I'm wondering what is the most appropriate way of accessing databricks from Azure data factory.
Currently I've got databricks as a linked service to which I gain access via a generated token.
What do you want to do?
Do you want to trigger a Databricks notebook from ADF?
Do you want to supply Databricks with data? (blob storage or Azure Data Lake store)
Do you want to retrieve data from Databricks? (blob storage or Azure Data Lake store)

Copy Data Factory selected pipelines from Azure one subscription to other

I have data pipeline in Azure Data Factory which copies the files from AWS S3 Bucket to Azure Data lake gen 2. So in order to build this pipeline I have created various resources like Azure Data Lake Gen 2 Storage, File system in ADLS with specific permissions, Data Factory, Source Dataset which connects to S3 Bucket, Target Dataset which connects to ADLS Gen2 folder.
So all these were created in a Dev subscription in Azure but now I want to deploy these resources in Prod subscription with least manual effort. I tried the ARM template approach which does not allow me to selectively choose the pipeline for migration. It copies everything present on the data factory which I don't want considering I may have different pipelines which are still under development phase and I do not want those to be migrated to Prod. I tried the powershell approach too which also has some limitations.
I would need expert advice on what is the best way to migrate the code from one subscription to other.

Resources