How to move data from azure data lake to cosmosdb on a schedule? - azure

I have data that needs to be moved from azure data lake to cosmosdb. The data is small, maybe < 1000 records per day. Each record is maybe < 5kb. I need this data to be exported from azure data lake and imported to cosmosdb as a timed job. The data should be moved 1 time per day. Ideally this would be configurable to many times a day. Right now I am considering using a function app to spin up on a schedule and make this export/import. However this feels wrong. I feel like there must be a better way to do this. What is the correct way to solve this problem?

You can use copy data tool, with schedule task for source: Azure Data Lake and sink: CosmosDB
See: Copy Data tool

Related

Looking for an alternative solution to processing tens of thousands of JSONs from Azure Blob to Azure SQL DB

I currently have pipelines developed that leverage Azure Data Factory for orchestration and Azure DataBricks for it's compute to perform the following actions... I receive tens of thousands of single record json files into Azure Blob in a real-time basis and on a 15 minute basis i check the folders for any new files and once found I load them into a dataframe using Databricks and load these into a single file in SQL DB before having other ADF jobs trigger stored procedures which then transform my data into final SQL tables.... We are looking to move away from Databricks as we are not using it for it's true capabilities but are of course paying the Databricks costs. Looking for ideas on other solutions to load tens of thousands of jsons into SQL DB (with minimal to no transformations) on a periodic (i.e. 15 minute) basis. We are a microsoft shop so not looking to necessarily move away from Azure tools.
Here a few ideas:
use Azure Functions + Blob Trigger / Event Grid to process the JSON files in real time (every time a new JSON file arrives, it will trigger your function). Then you could either insert into the final table or on a temporary table.
another idea would be combine Azure Functions + Blob Trigger / Event Grid to sink the data to a data lake. You can use ADF to sink it to SQL final tables.
Azure SQL DB is actually pretty capable as far as JSON goes so you could just use OPENROWSET to import the data directly from blob store and OPENJSON to shred it. You could then use a Logic App running on a schedule to call the proc say every 15 minutes, you wouldn't even need ADF as part of the solution.
I've worked up a couple of similar answers previously, eg here and here, but let me know if you want to progress more down this route and we can work up something more detailed.

Reading data from lake

I need to read data from azure data from azure data lake and apply some joins in sql and show in Web UI.
Data is around 300 gb and migrating data from azure data factory to azure sql database is happening at the speed of 4Mbps.
I have also tried to use sql server 2019 which has polybase support but that is also taking 12-13 hours to copy data.
Also tried cosmos db for storing data from lake but seems it is taking large amount of time.
Any other way we can read data from lake.
One way can be azure data warehouse,but that is too costly and support only 128 concurrent transactions.
Can databricks be used,but its a computation engine and we need it to be available 24*7 for UI Queries
I still suggest you using Azure Data Factory. As you said, your data is around 300 gb.
Here's the Copy performance and scalability achievable using ADF:
I agree with David Makogon. The performance of your Data Factory is very slowly( 4Mbps). Please reference this document Copy activity performance and scalability guide.
It will help you improve the Data Factory data copy performance, give more suggestions about Data Factory settings or Database settings.
Hope this helps.
I had a very similar situation, just more data +-900GB.
If you need to show it in ui, you will still need to load data to Azure SQL, as DWH is not very good at handling parallel load and its costy.
We ended up using bulk insert from blob storage.
I created sp to call bulk insert with parameters (source file, target table) and ADF to orchestrate and run in parallel.
Could not find anything faster than that.
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15

Azure IoT data warehouse updates

I am building Azure IoT solution for my BI project. For now I have an application that once per set time window sends a .csv blob to Azure Blob Storage with incremental number in name. So after some time I will have in my storage files such as 'data1.csv', 'data2.csv', 'data3.csv', etc.
Now I will need to load these data into a database which will be my warehouse with the use of Azure Stream Analytics job. The issue might be that .CSV files will have overlapping data. They will be send every 4h and contain data for past 24h. I need to always read only last file (with highest number) and prepare lookup so it properly updates data in the warehouse. What will be the best approach to make Stream Analytics read only latest file and for updating records in DB?
EDIT:
TO clarify - I am fully aware that ASA is not capable of being an ETL job. My question is what would be best approach for my case with using IoT tools
I would suggest one of these 2 ways:
use ASA to write in a temporary SQL table, and the use a SQL trigger
to update the main table of the DW with the diff.
Or remove duplicates by adding a unique constraint as described here:
https://blogs.msdn.microsoft.com/streamanalytics/2017/01/13/how-to-achieve-exactly-once-delivery-for-sql-output/
Thanks,
JS - Azure Stream Analytics

Azure Data Sync - Copy Each SQL Row to Blob

I'm trying to understand the best way to migrate a large set of data - ~ 6M text rows from (an Azure Hosted) SQL Server to Blob storage.
For the most part, these records are archived records, and are rarely accessed - blob storage made sense as a place to hold these.
I have had a look at Azure Data Factory and it seems to be the right option, but I am unsure of it fulfilling requirements.
Simply the scenario is, for each row in the table, I want to create a blob, with the contents of 1 column from this row.
I see the tutorial (i.e. https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-tutorial-using-azure-portal) is good at explaining migration of bulk-to-bulk data pipeline, but I would like to migrate from a bulk-to-many dataset.
Hope that makes sense and someone can help?
As of now, Azure Data Factory does not have anything built in like a For Each loop in SSIS. You could use a custom .net activity to do this but it would require a lot of custom code.
I would ask, if you were transferring this to another database, would you create 6 million tables all with the same structure? What is to be gained by having the separate items?
Another alternative might be converting it to JSON which would be easy using Data Factory. Here is an example I did recently moving data into DocumentDB.
Copy From OnPrem SQL server to DocumentDB using custom activity in ADF Pipeline
SSIS 2016 with the Azure Feature Pack, giving Azure Tasks such as Azure Blob Upload Task and Azure Blob Destination. You might be better off using this, maybe an OLEDB command or the For Each loop with an Azure Blob destination could be another option.
Good luck!
Azure has a ForEach activity which can be place after LookUp or Metadata to get the each row from SQL to blob
ForEach

best design solution to migrate data from SQL Azure to Azure Table

In our service, we are using SQL Azure as the main storage, and Azure table for the backup storage. Everyday about 30GB data is collected and stored to SQL Azure. Since the data is no longer valid from the next day, we want to migrate the data from SQL Azure to Azure table every night.
The question is.. what would be the most efficient way to migrate data from Azure to Azure table?
The naive idea i came up with is to leverage the producer/consumer concept by using IDataReader. That is, first get a data reader by executing "select * from TABLE" and put data into a queue. At the same time, a set of threads are working to grab data from the queue, and insert them into Azure Table.
Of course, the main disadvantage of this approach (i think) is that we need to maintain the opened connection for a long time (might be several hours).
Another approach is to first copy data from SQL Azure table to local storage on Windows Azure, and use the same producer/consumer concept. In this approach we can disconnect the connection as soon as the copy is done.
At this point, i'm not sure which one is better, or even either of them is a good design to implement. Could you suggest any good design solution for this problem?
Thanks!
I would not recommend using local storage primarily because
It is transient storage.
You're limited by the size of local storage (which in turn depends on the size of the VM).
Local storage is local only i.e. it is accessible only to the VM in which it is created thus preventing you from scaling out your solution.
I like the idea of using queues, however I see some issues there as well:
Assuming you're planning on storing each row in a queue as a message, you would be performing a lot of storage transactions. If we assume that your row size is 64KB, to store 30 GB of data you would be doing about 500000 write transactions (and similarly 500000 read transactions) - I hope I got my math right :). Even though the storage transactions are cheap, I still think you'll be doing a lot of transactions which would slow down the entire process.
Since you're doing so many transactions, you may get hit by storage thresholds. You may want to check into that.
Yet another limitation is the maximum size of a message. Currently a maximum of 64KB of data can be stored in a single message. What would happen if your row size is more than that?
I would actually recommend throwing blob storage in the mix. What you could do is read a chunk of data from SQL table (say 10000 or 100000 records) and save that data in blob storage as a file. Depending on how you want to put the data in table storage, you could store the data in CSV, JSON or XML format (XML format for preserving data types if it is needed). Once the file is written in blob storage, you could write a message in the queue. The message will contain the URI of the blob you've just written. Your worker role (processor) will continuously poll this queue, get one message, fetch the file from blob storage and process that file. Once the worker role has processed the file, you could simply delete that file and the message.

Resources