Moving a DocumentDB Collection to Azure Data Lake Storage - azure

I was wondering what's the best practice moving a documentDB to the Azure Data Lake Storage.
Should I create a file for each document in a collection or move the entire documentDB?
Also I didn't find much information on how I can access the documentDB using U-SQL?
Input would be appreciated.

You currently cannot use U-SQL to access data in DocumentDB (or now called CosmosDB). There is a feature request here. Please feel free to add your vote.
If you move the data over, the organization depends on how you want to manage the data (delete all, or only parts?), how it is structured (keep similar structured data together, either in same file or same folder) and how you use it (always need all of it? or only parts?) and what gives you the best performance accessing it (larger files are normally better, but if they are JSON, also make sure the extraction process works).

You can use Azure Data Factory to connect to Document DB and store your data on Data Lake.
After that you can query the data directly from Data Lake using U-SQL.

Related

Azure Data Lake for Structured Data

We've been reviewing the Modern Data Warehouse architectures from Microsoft (link here), which references using Azure Data Factory to pull structured and unstructured data into the Azure Data Lake. I've attended a lot of presentations on the subject as well, but most people are split on whether the Data Lake is a good home for structured data. What I am trying to determine is if importing data into the Data Lake is a good strategy if the only source we will be utilizing is on-prem SQL Server databases? And, what would be the advantage / disadvantages of that strategy?
For context sake, we're looking for a single pane of glass for consumption - whether it's end user's reporting with Power BI, or fodder for Azure Data Warehouse / on-prem Data Warehouse. We want one container that is the source for all of these systems, which is not the source OLTP system (i.e. OLTP database --> (Azure Data Factory) --> Data Lake --> everything else).
I appreciate any guidance on the subject. Thank you.
You have not mentioned the data size and I think for moving to ADL , the data is a very strong parameter . In your case the data is very much structured . If you we had unstructured & massive data and if you wanted to use ADB or Hadoop or any other technology to process it later , i think ADL is a good candidate .
You should also consider that the data is encrypted in motion using SSL .You can authorize users and groups with fine-grained POSIX-based ACLs for all data in the Store enabling role-based access controls .
The only real value in taking stuctured data, flattening it and loading it into a data lake is to save cost and decouple the data from any proprietary tool/compute. In your scenario, it will be less expensive to store the data in a data lake store vs. Azure SQL Database.
However, there is a complexity cost to flattening the data. You will need to restructure the data (ie. load it back into a database, or wrap logical structure) when you need to consume the data. Formats such as Parquet will help with this, but it is more complex for users to query data in a datalake than it is to connect to a relational database. Most all analysts and data consumers will know how to query a relational database, especially if the data is already in SQL Server.
Look at the volume of data and use cases for consumption to make that decision. A "logical datalake" can include both structured data in a relational database, semi structured data flattened in a storage account, and unstructured data saved to a storage account.

Reading data from lake

I need to read data from azure data from azure data lake and apply some joins in sql and show in Web UI.
Data is around 300 gb and migrating data from azure data factory to azure sql database is happening at the speed of 4Mbps.
I have also tried to use sql server 2019 which has polybase support but that is also taking 12-13 hours to copy data.
Also tried cosmos db for storing data from lake but seems it is taking large amount of time.
Any other way we can read data from lake.
One way can be azure data warehouse,but that is too costly and support only 128 concurrent transactions.
Can databricks be used,but its a computation engine and we need it to be available 24*7 for UI Queries
I still suggest you using Azure Data Factory. As you said, your data is around 300 gb.
Here's the Copy performance and scalability achievable using ADF:
I agree with David Makogon. The performance of your Data Factory is very slowly( 4Mbps). Please reference this document Copy activity performance and scalability guide.
It will help you improve the Data Factory data copy performance, give more suggestions about Data Factory settings or Database settings.
Hope this helps.
I had a very similar situation, just more data +-900GB.
If you need to show it in ui, you will still need to load data to Azure SQL, as DWH is not very good at handling parallel load and its costy.
We ended up using bulk insert from blob storage.
I created sp to call bulk insert with parameters (source file, target table) and ADF to orchestrate and run in parallel.
Could not find anything faster than that.
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15

Can I store txt files in Azure SQL Database?

Hi is it possible to store .txt files inside azure sql database? if not then which service provides this?
You can write them into an nvarchar(max) column if you'd like. If they are CSV files, you may want to shred them into columns in a table. However, if you have a lot of text file data, you may find it cheaper to use Azure Storage instead. It is generally better for colder data where you don't need to process it if your text data is all want to store or if it is a majority of what you are trying to store.
hope that helps
You can have them first uploaded/stored on an Azure Storage Account and have them automatically processed and uploaded to a table on an Azure SQL Database using Azure Functions as explained here. Azure Functions have triggers to respond to an event like a new file has been uploaded.

Are we able to use Snappy-data to Update a record in Azure Data lake ? OR is Azure data lake append only?

I am currently working on azure data lake with snappy-data integration,I have a query on snappy-data are we able to update the data in the snappy-data to azure data lake storage, or we can append only on the azure data lake storage i searched in forum but i can't reach for that proper solution on it,if any one know about that query on it please share it,thank you.
Azure Data Lake Store, much like HDFS, is an append only store. You can append to a file or replace it altogether. There is no way to update an existing file.
I've achieved MERGE style behaviour in USQL by using a Azure Data Lake table as the middle ground between input and output. Check out my blog post with the code showing how I did it with a series of joins.
https://www.purplefrogsystems.com/paul/2016/12/writing-a-u-sql-merge-statement/
This will give you append behaviour in your output.

Azure Data Sync - Copy Each SQL Row to Blob

I'm trying to understand the best way to migrate a large set of data - ~ 6M text rows from (an Azure Hosted) SQL Server to Blob storage.
For the most part, these records are archived records, and are rarely accessed - blob storage made sense as a place to hold these.
I have had a look at Azure Data Factory and it seems to be the right option, but I am unsure of it fulfilling requirements.
Simply the scenario is, for each row in the table, I want to create a blob, with the contents of 1 column from this row.
I see the tutorial (i.e. https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-tutorial-using-azure-portal) is good at explaining migration of bulk-to-bulk data pipeline, but I would like to migrate from a bulk-to-many dataset.
Hope that makes sense and someone can help?
As of now, Azure Data Factory does not have anything built in like a For Each loop in SSIS. You could use a custom .net activity to do this but it would require a lot of custom code.
I would ask, if you were transferring this to another database, would you create 6 million tables all with the same structure? What is to be gained by having the separate items?
Another alternative might be converting it to JSON which would be easy using Data Factory. Here is an example I did recently moving data into DocumentDB.
Copy From OnPrem SQL server to DocumentDB using custom activity in ADF Pipeline
SSIS 2016 with the Azure Feature Pack, giving Azure Tasks such as Azure Blob Upload Task and Azure Blob Destination. You might be better off using this, maybe an OLEDB command or the For Each loop with an Azure Blob destination could be another option.
Good luck!
Azure has a ForEach activity which can be place after LookUp or Metadata to get the each row from SQL to blob
ForEach

Resources