Is it possible to access Databricks DBFS from Azure Data Factory? - azure

I am trying to use the Copy Data Activity to copy data from Databricks DBFS to another place on the DBFS, but I am not sure if this is possible.
When I select Azure Delta Storage as a dataset source or sink, I am able to access the tables in the cluster and preview the data, but when validating it says that the tables are not delta tables (which they aren't, but I don't seem to acsess the persistent data on DBFS)
Furthermore, what I want to access is the DBFS, not the cluster tables. Is there an option for this?

Related

Azure Data Factory: Ingest - from Delta table to Postgres

I am currently creating an ingest pipeline to copy data from a delta table to a postgres table. When selecting the sink, I am asked to enable staging.
Direct copying data from Azure Databricks Delta Lake is only supported when sink dataset is DelimitedText, Parquet or Avro with Azure Blob Storage linked service or Azure Data Lake Storage Gen2, for other dataset or linked service, please enable staging
This will turn my pipeline into a 2 step process where my delta table data is copied to a staging location and then from there it is inserted into postgres. How can I take the delta table data and directly load it directly into postgres using an ingest pipeline in ADF without staging? Is this possible?
As suggested by #Karthikeyan Rasipalay Durairaj in comments, you can directly copy data from databricks to postgresql
To copy data from Azure databricks to postgresql use below code -
df.write().option('driver', 'org.postgresql.Driver').jdbc(url_connect, table, mode, properties)
Staged copy from delta lake
When your sink data store or format does not match the direct copy criteria, It enables the built-in staged copy using an interim Azure storage instance. The staged copy feature also provides you better throughput. The service exports data from Azure Databricks Delta Lake into staging storage, then copies the data to sink, and finally cleans up your temporary data from the staging storage.
Direct copy from delta lake
If your sink data store and format meet the criteria described below, you can use the Copy activity to directly copy from Azure Databricks Delta table to sink.
• The sink linked service is Azure Blob storage or Azure Data Lake Storage Gen2. The account credential should be pre-configured in Azure Databricks cluster configuration.
• The sink data format is of Parquet, delimited text, or Avro with the following configurations, and points to a folder instead of file.
• In the Copy activity source, additionalColumns is not specified.
• If copying data to delimited text, in copy activity sink, fileExtension need to be ".csv".
Refer this documentation

Moving data from Teradata to Snowflake

Trying to move data from Teradata to Snowflake. Have created a process to run TPT scripts for each table to generate files for each table.
Files are also split to achieve concurrency while running COPY INTO in snowflake.
Need to understand what is the best way to move those Files from On Prem Linux Machine to Azure ADLS. Considering files in Terabyte size.
Does Azure provide any mechanism to move these files or can we directly create files on ADLS from Teradata?
The best approach to load data to snowflake via external table if you have the Azure Blob Storage or ADLS Gen2. Load data to blob storage and create external table and then load data data to snowflake.

How to perform data factory transformations on large datasets in Azure data warehouse

We have Data warehouse tables that we perform transformations using ADF.
If I have a group of ADW tables, and I need to perform transformations on them to land them back onto ADW, should I save the transformations into Azure Blob Storage? or go direct into the target table.
The ADW tables are in excess of 100Mil records.
Is it an acceptable practice to use Blob Storage as the middle piece.
I can think of two ways to do this (they do not require moving the data into blob storage),
Do the transformation within SQL DW using stored procedure and use ADF to orchestrate the stored procedure call
Use ADF's data flow to apply the transformation to read from SQL DW and write back to SQL DW
Yes, you'd better using the use Blob Storage as the middle piece.
You can not copy the tables from SQL DW(Source) to the same SQL DW(Sink) directly! If you have tried this, you will have the problems:
Copy data: has the error in data mapping, copy data to the same table, not create new tales.
Copy Active: Table is required for Copy activity.
If you want to copy the data from SQL DW tables to new tables with Data Factor, you need at least two steps:
copy the data from the SQL DW tables to Blob storage(create the csv
files).
Load these csv files to SQL DW and create new tables.
Reference tutorials:
Copy and transform data in Azure Synapse Analytics (formerly Azure
SQL Data Warehouse) by using Azure Data Factory
Copy and transform data in Azure Blob storage by using Azure Data
Factory
Data Factory is good at transfer big data. Reference Copy performance of Data Factory. I think it may faster than SELECT - INTO Clause (Transact-SQL).
Hope this helps.

Is possible to read an Azure Databricks table from Azure Data Factory?

I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase.
Is it possible to acess databricks tables through Azure Data factory?
No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here.
You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table to Blob storage as it's final action. In the next step of your data factory job, you can then read the dumped data from the storage account and process further.
Another option may be databricks delta although I have not tried this yet...
If you register the table in the Databricks hive metastore then ADF could read from it using the ODBC source in ADF. Though this would require an IR.
Alternatively you could write the table to external storage such as blob or lake. ADF can then read that file and push it to your sql database.

How to read/write data from/to Azure Table using Hadoop?

I'd like to have a hadoop job which read data from Azure table storage and write data back into it. How can I do that?
I'm mostly interested in writing data into Azure tables from HDInsight.

Resources