Azure Data Factory: Ingest - from Delta table to Postgres - azure

I am currently creating an ingest pipeline to copy data from a delta table to a postgres table. When selecting the sink, I am asked to enable staging.
Direct copying data from Azure Databricks Delta Lake is only supported when sink dataset is DelimitedText, Parquet or Avro with Azure Blob Storage linked service or Azure Data Lake Storage Gen2, for other dataset or linked service, please enable staging
This will turn my pipeline into a 2 step process where my delta table data is copied to a staging location and then from there it is inserted into postgres. How can I take the delta table data and directly load it directly into postgres using an ingest pipeline in ADF without staging? Is this possible?

As suggested by #Karthikeyan Rasipalay Durairaj in comments, you can directly copy data from databricks to postgresql
To copy data from Azure databricks to postgresql use below code -
df.write().option('driver', 'org.postgresql.Driver').jdbc(url_connect, table, mode, properties)
Staged copy from delta lake
When your sink data store or format does not match the direct copy criteria, It enables the built-in staged copy using an interim Azure storage instance. The staged copy feature also provides you better throughput. The service exports data from Azure Databricks Delta Lake into staging storage, then copies the data to sink, and finally cleans up your temporary data from the staging storage.
Direct copy from delta lake
If your sink data store and format meet the criteria described below, you can use the Copy activity to directly copy from Azure Databricks Delta table to sink.
• The sink linked service is Azure Blob storage or Azure Data Lake Storage Gen2. The account credential should be pre-configured in Azure Databricks cluster configuration.
• The sink data format is of Parquet, delimited text, or Avro with the following configurations, and points to a folder instead of file.
• In the Copy activity source, additionalColumns is not specified.
• If copying data to delimited text, in copy activity sink, fileExtension need to be ".csv".
Refer this documentation

Related

Keep staging Blobs in Data Lake after copy activity

I've been copying date into Synapse using the copy data functionality in Azure data factory (polybase), with staging enabled to stage the data in our azure data lake. However, once the copy into Synapse is complete the staging files in our azure data lake get deleted.
Is there any way to keep the staged files in the data lake after the copy activity has finished, rather than deleting?
As per the document, the staged data gets cleaned up after the data movement to the sink is completed.
Copy activity performance optimization features - Azure Data Factory & Azure Synapse | Microsoft Docs
If you want to copy data to data lake storage, you can use other copy activity and store the data in data lake storage as sink data store.

Copy Data from Azure Data Lake to SnowFlake without stage using Azure Data Factory

All the Azure Data Factory examples of copying data from Azure Data Lake Gen 2 to SnowFlake use a storage account as stage. If the stage is not configured (as shown in picture), I get this error in Data Factory even when my source is a csv file in Azure data lake - "Direct copying data to Snowflake is only supported when source dataset is DelimitedText, Parquet, JSON with Azure Blob Storage or Amazon S3 linked service, for other dataset or linked service, please enable staging".
At the same time, SnowFlake documentation says the the external stage is optional. How can I copy data from Azure Data Lake to SnowFlake using Data Factory's Copy Data Activity without having an external storage account as stage?
If staging storage is needed to make it work, we shouldn't say that data copy from Data Lake to SnowFlake is supported. It works only when, Data Lake data is is first copied in a storage blob and then to SnowFlake.
Though Snowflake supports blob storage, Data Lake storage Gen2, General purpose v1 & v2 storages, loading data into snowflake is supported- through blob storage only.
The source linked service is Azure Blob storage with shared access signature authentication. If you want to directly copy data from Azure Data Lake Storage Gen2 in the following supported format, you can create an Azure Blob linked service with SAS authentication against your ADLS Gen2 account, to avoid using staged copy to Snowflake.
Select Azure blob storage in linked service, provide SAS URI details of Azure data lake gen2 source file.
Blob storage linked service with data lake gen2 file:
You'll have to configure blob storage and use it as staging. As an alternative you can use external stage. You'll have to create a FILE TYPE and NOTIFICATION INTEGRATION and access the ADLS and load data into Snowflake using copy command. Let me know if you need more help on this.

Is it possible to access Databricks DBFS from Azure Data Factory?

I am trying to use the Copy Data Activity to copy data from Databricks DBFS to another place on the DBFS, but I am not sure if this is possible.
When I select Azure Delta Storage as a dataset source or sink, I am able to access the tables in the cluster and preview the data, but when validating it says that the tables are not delta tables (which they aren't, but I don't seem to acsess the persistent data on DBFS)
Furthermore, what I want to access is the DBFS, not the cluster tables. Is there an option for this?

How to load array<string> data type from parquet file stored in Amazon S3 to Azure Data Warehouse?

I am working with parquet files stored on Amazon S3. These files need to be extracted and the data from it needs to be loaded into Azure Data Warehouse.
My plan is:
Amazon S3 -> Use SAP BODS to move parquet files to Azure Blob -> Create External tables on those parquet files -> Staging -> Fact/ Dim tables
Now the problem is that in one of the parquet files there is a column that is stored as an array<string>. I am able to create external table on it using varchar data type for that column but if I perform any sql query operation (i.e. Select) on that external table then it throws below error.
Msg 106000, Level 16, State 1, Line 3
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered
filling record reader buffer: ClassCastException: optional group
status (LIST) {
repeated group bag {
optional binary array_element (UTF8);
}
} is not primitive
I have tried different data types but unable to run select query on that external table.
Please let me know if there are any other options.
Thanks
On Azure, there is a service named Azure Data Factory, I think which can be used in your current scenario, as the document Parquet format in Azure Data Factory said below.
Parquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.
And you can try to follow the tutorial Load data into Azure SQL Data Warehouse by using Azure Data Factory to set Amazon S3 with parquet format as source to directly copy data to Azure SQL Data Warehouse. Due to read the data from the parquet format file with auto schema parsering, it should be easy for your task using Azure Data Factory.
Hope it helps.

Is possible to read an Azure Databricks table from Azure Data Factory?

I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase.
Is it possible to acess databricks tables through Azure Data factory?
No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here.
You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table to Blob storage as it's final action. In the next step of your data factory job, you can then read the dumped data from the storage account and process further.
Another option may be databricks delta although I have not tried this yet...
If you register the table in the Databricks hive metastore then ADF could read from it using the ODBC source in ADF. Though this would require an IR.
Alternatively you could write the table to external storage such as blob or lake. ADF can then read that file and push it to your sql database.

Resources