What is the max writeBatchSize for the REST as sink in Azure Data Factory - azure

We are using Azure Data Factory to copy data from on-premise SQL table to a REST endpoint, for example, Google Cloud Storage. Our source table has more than 3 million of rows. Based the document https://learn.microsoft.com/en-us/azure/data-factory/connector-rest#copy-activity-properties, the default value for writeBatchSize (number of records write to the REST sink per batch) is 10000. I tried to increase the size up to 5,000,000 and 1,000,000, and noticed the final file size are the same. It shows that not all the 3M records were written to GCS. Does anyone know what is max size for writeBatchSize? The pagination seems only for the case that using REST as source. I wonder if there is any workaround for my case?

Related

Azure data factory - Copy Data activity Sink - Max rows per file property

I would like to spilt my big size file into smaller chunks inside blob storage via ADF copy data activity. I am trying to do so using Max Rows per file property in Copy activity sink but my file is not getting spilt into smaller files rather I get the same big size file in result, can anyone share any valuable info here?
I tested it, and it can work fine.
Source dataset
2.Source setting
3.Sink dataset
4.Sink setting
Result:

Is it possible to limit the file size or count when creating external table in azure data warehouse?

While creating external tables in azure data warehouse,
files are generated in the location we set.
Currently we noticed that the file size and count are decided by the scale level of data warehouse.
I'm curious is it possible to set those values like, only generate not more than 100 files,
or each file should not greater than 5 GB.
Thanks.
So far it is not possible to set up those. it's all automatically decided by Azure DW.

Best method to transfer and transfrom large amount of data from a SQL Server to an Azure SQL Server. Azure Data Factory, HDInsight, etc

I like to find the best methods of transferring 20 GB of SQL data from a SQL Server database installed on a customer onsite server, Client, to our Azure SQL Server, Source, on an S4 with 200 DTUs performance for $320 a month. When doing an initial setup, we set up an Azure Data Factory that copies over the 20 GB via multiple table copies, e.g., Client Table A's content to Source Table A, Client Table B's content to Source Table B, etc. Then we run many Extractors store procedures that insert into Stage tables the data from the Source tables by joining these Source Table together, e.g., Source A join to Source B. After that is incremental copies, but the initial setup do take forever.
Currently the copying time on an S4 is around 12 hours with the extracting time to be 4 hours. Increasing the performance tier to an S9 of 1600 DTUs for $2400 a month will decrease time to 6 hours with the extracting time to be 2 hours, but that bring with it the higher cost.
I was wondering if there was other Azure methods. Does setting up an HDInsight cluster with Hadoop or Spark be more efficient for cost compare to scaling up the Azure SQL DB to an S9 and more? An S9 of $2400 a month of 31 days is $3.28 an hour. Azure HDInsight Clusters of Memorized Optimized Nodes of a D14 v2 instance is $1.496 per hour so it would be cheaper than an S9. However, how does it compare in terms of performance. Would the copying process be quicker or will the extraction process be quicker?
I am not used to Big Data methods yet. Thank you for all the help.
Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores.Copy Activity offers a highly optimized data loading experience that is easy to configure and set up.
You can see the performance reference table about Copy Activity:
The table shows the copy throughput number in MBps for the given source and sink pairs in a single copy activity run based on in-house testing.
If you want the data could be transfered quicker by using Azure Data Factory Copy Activity, Azure provides three ways to achieve higher throughput:
Data integration units. A Data Integration Unit (DIU) (formerly known as Cloud Data Movement Unit or DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. You can achieve higher throughput by using more Data Integration Units (DIU).You are charged based on the total time of the copy operation. The total duration you are billed for data movement is the sum of duration across DIUs.
Parallel Copy. We can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use.For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.
Staged copy. When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store.
You can take these ways to tune the performance of your Data Factory service with Copy Activity.
For more details about Azure Data Factory Copy Activity performace, please see:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units

How to speed up copy from Azure Data Lake to Cosmos DB

I'm using Azure Data Factory to copy data from Azure Data Lake Store to a collection in Cosmos DB. We will have a few thousand JSON files in data lake and each JSON file is approx. 3 GB. I'm using data factory's copy activity and in the initial run, one file took 3.5 hours to load with the collection set to 10000 RU/s and data factory using default settings. Now I've scaled it up to 50000 RU/s, set cloudDataMovementUnits to 32 and writeBatchSize to 10 to see if it improved the speed, and the same file now takes 2.5 hours to load. Still the time to load thousands of files will take way to long time.
Is there some way to do this in a better way?
You say you are inserting "millions" of json documents per 3Gb batch file. Such lack of precision is not helpful when asking this type of question.
Let's run the numbers for 10 million docs per file.
This indicates 300 bytes per json doc which implies quite a lot of fields per doc to index on each CosmosDb insert.
If each insert costs 10 RUs then at your budgeted 10,000 RU per sec the doc insert rate would be 1000 x 3600 (seconds per hour) = 3.6 million doc inserts per hour.
So your observation of 3.5 hours to insert 3 Gb of data representing an assumed 10 million docs is highly consistent with your purchased CosmosDb throughput.
This document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance illustrates that the DataLake to CosmosDb Cloud Sink performs poorly compared to other options. I guess the poor performance can be attributed to the default index-everything policy of CosmosDb.
Does your application need everything indexed? Does the CommosDb Cloud Sink utilise less strict eventual consistency when performing bulk inserts?
You ask, is there a better way? The performance table in the linked MS document shows that Data Lake to Polybase Azure Data Warehouse is 20,000 times more performant.
One final thought. Does the increased concurrency of your second test trigger CosmosDb throttling? The MS performance doc warns about monitoring for these events.
The bottom line is that trying to copy millions of Json files will take time. If it was organized GB of data you could get away with shorter time batch transfers but not with millions of different files.
I don't know if you plan on transferring this type of file from Data Lake often but a good strategy could be to write an application dedicated to do that. Using Microsoft.Azure.DocumentDB Client Library you can easily create a C# web app that manages your transfers.
This way you can automate those transfers, throttle them, schedule them, etc. You can also host this app on a vm or app service and never really have to think about it.

Limited results when using Excel BigQuery connector

I've pulled a data extract from BigQuery using the Excel connector but my results have been limited to 230,000 records.
Is this a limitation of the connector or something I have not done properly?
BigQuery does have a maximum response size of 64MB (compressed). So, depending on the size of your rows, it's quite possible that 230,000 is the maximum size response BigQuery can return.
See more info on quotas here:
https://developers.google.com/bigquery/docs/quota-policy
What's the use case -- and how many rows are you expecting to be returned? Generally BigQuery is used for large aggregate analysis, rather than results which return tons of unaggregated results. You can dump the entire table as a CSV into Google Cloud Storage if you're looking for your raw dataset too.
Also, you may want to try running the query in the UI at:
https://bigquery.cloud.google.com/

Resources