Move data from Azure Data Lake to Big Query

Move data from Azure Data Lake to Big Query - azure

I wanted to move some data on daily basis from azure data lake to Big query using Azure Data Factory. However, ADF does not support Big Query as sink. What would you suggest? Any GCP service analogue to ADF to perform this task?
Thanks!

However, ADF does not support Big Query as sink.
Yes, ADF can only support Google Big Query as the source. So this means ADF can not achieve your requirement.
Any GCP service analogue to ADF to perform this task?
It seems that there is no ready-made tool, maybe you can write code to get data from datalake and copy it?

Related

Can I ingest data into tables in azure data explorer through Databricks?

I have to ingest data from ADLS gen2 to tables in ADX.
I first tried to use ADF, but considering the run time and everything, this was an really inefficient way to do it.
I think it would be so much easier if there is a way to ingest data into ADX via databricks.
I only can access ADLS through Service Pricipals so I can't directly connect to the storage account from ADX.
Any help will be much appreciated!

If it is for continuous ingestion consider creating an Event Grid data connection
If it is for one-time ingestion consider using the "lightIngest" tool, and provide the applicable connection string for your use case.

I would also recommend the Event Grid Data connection. Please note that you can set up this data connection using managed identities.

Is there a simple way to ETL from Azure Blob Storage to Snowflake EDW?

I have the following ETL requirements for Snowflake on Azure and would like to implement the simplest possible solution because of timeline and technology constraints.
Requirements :
Load CSV data (only a few MBs) from Azure Blob Storage into Snowflake Warehouse daily into a staging table.
Transform the loaded data above within Snowflake itself where transformation is limited to just a few joins and aggregations to obtain a few measures. And finally, park this data into our final tables in a Datamart within the same Snowflake DB.
Lastly, automate the above pipeline using a schedule OR using an event based trigger (i.e. steps to kick in as soon as file lands in Blob Store).
Constraints :
We cannot use use Azure Data Factory to achieve this simplest design.
We cannot use Azure Functions to deploy Python Transformation scripts and schedule them either.
And, I found that Transformation using Snowflake SQL is a limited feature where it only allows certain things as part of COPY INTO command but does not support JOINS and GROUP BY. Furthermore, although the following THREAD suggests that scheduling SQL is possible, but that doesn't address my Transformation requirement.
Regards,
Roy
Attaching the following Idea diagram for more clarity.
https://community.snowflake.com/s/question/0D50Z00009Z3O7hSAF/how-to-schedule-jobs-from-azure-cloud-for-loading-data-from-blobscheduling-snowflake-scripts-since-dont-have-cost-for-etl-tool-purchase-for-scheduling
https://docs.snowflake.com/en/user-guide/data-load-transform.html#:~:text=Snowflake%20supports%20transforming%20data%20while,columns%20during%20a%20data%20load.

You can create snowpipe on Azure blob storage, Once snowpipe created on top of your azure blob storage, It will monitor bucket and file will be loaded into your stage table as soon as new file comes in. After copied the data into stage table you can schedule transformation SQL using snowflake task.
You can refer snowpipe creation step for azure blob storage in below link:
Snowpipe on microsoft Azure blob storage

When to use Data Factory (copy) over direct pull in SQL synapse

I am just going through some Microsoft Document and doing handOn for Data engineering related things.
I have couple of queries for a scenrerio - "copy CSV file(s) from Blob storage to Synapse analytics (stage table(s)):
I read that we can do direct data pull in Synapse with the process of creating external tables. (https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/load-data-wideworldimportersdw)
If above is possible, then in what cases we do use Azure Data factory Copy or data flow method?
While working with Azure data factory, is it a good idea to use Polybase, because it will use Blob storage again as staging in this scenrerio (i.e. I am copying file from Blob only and again using blob for staging)?
I searched for answers to my queries but haven't found any satisfactory answer yet.

If you're just straight loading data from CSV into DW, use Copy. Polybase is recommended, but not always needed for small files.
If you need to transform that data or perform updates, then use data flows.

Reading data from lake

I need to read data from azure data from azure data lake and apply some joins in sql and show in Web UI.
Data is around 300 gb and migrating data from azure data factory to azure sql database is happening at the speed of 4Mbps.
I have also tried to use sql server 2019 which has polybase support but that is also taking 12-13 hours to copy data.
Also tried cosmos db for storing data from lake but seems it is taking large amount of time.
Any other way we can read data from lake.
One way can be azure data warehouse,but that is too costly and support only 128 concurrent transactions.
Can databricks be used,but its a computation engine and we need it to be available 24*7 for UI Queries

I still suggest you using Azure Data Factory. As you said, your data is around 300 gb.
Here's the Copy performance and scalability achievable using ADF:
I agree with David Makogon. The performance of your Data Factory is very slowly( 4Mbps). Please reference this document Copy activity performance and scalability guide.
It will help you improve the Data Factory data copy performance, give more suggestions about Data Factory settings or Database settings.
Hope this helps.

I had a very similar situation, just more data +-900GB.
If you need to show it in ui, you will still need to load data to Azure SQL, as DWH is not very good at handling parallel load and its costy.
We ended up using bulk insert from blob storage.
I created sp to call bulk insert with parameters (source file, target table) and ADF to orchestrate and run in parallel.
Could not find anything faster than that.
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?

Depends.
Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.
Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.

Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.
If your sources are datafiles (in any format) you could consider using databricks for etl.
I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.
Big advantage is the power you have at your disposal to parse the files.
Best regards.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string