Method for retrying failed batch copies in batch processing - azure

I have an Azure Data Factory pipeline that is copying data from an SQL DB to an Azure Blob. It works by scanning an SQL table and querying the BatchRecord table (a sort of metadata table) and finding all records that have the same projectID and UniqueID (just two ids used to identify a customer and an account). Once all these matching records are found, they are grouped together to form a batch. This batch of records is then used to pull records from a second table and move them to the Azure Blob. There are many records that have the same projectID and UniqueID (representing multiple transactions made by the same client on different dates).
Each copy activity (and some other code) is contained inside a forEach loop. The forEach loop is guided by a lookup activity that uses the projectID and UniqueID to determine how many times it needs to loop. All the copy code is found inside the forEach loop.
Because the number of records is quite large, some of the copy activities in each batch fail because of network issues. My question is this: Is there some way to retry only these failed copy activities after all the records have been attempted once? That is, I want to process all records (regardless of failures) then I want to re-attempt only the failed records. If possible, I would also like to be able to control how many times a retry is attempted (maybe two retries, maybe ten depending on what each client needs). How can I do this?

Copy Activity in ADF has retry and retry interval settings, by default retry is set to 0, so it does not retry the copy if failed. you could increase the number of retries on the copy activity.

Related

Cache Lookup Properties in Azure Data Factory

I have a requirement where in I have a source file containing the Table Name(s) in Mapping Data Flow. Based on the Table Name in the file - there needs to be a dynamic query where column metadata, along with some other properties is retrieved from the data dictionary tables and inserted into a different sink table. The table name from the file would be used as a where condition filter.
Since there can be multiple tables listed in the input file (lets assume its a csv with only one column containing the table names), if we decide to use a cache sink for the file :
Is it possible to use the results of that cached sink in the Source transformation query in the same mapping data flow - as a lookup (from where the column metadata is being retrieved) and if Yes, how
What would be the best way to restrict data from the metadata table query based on this table name
Though of alternatively achieving this with a pipeline using For Each passing the table name as parameter to data flow, but in this case if there are 100 tables in the file, there would be 100 iterations and 100 times the cluster would need to be spun up. Please advise if this is wronf or there are better ways to achieve this
You would need to use option 3. Loop through the table names and pass each in as a parameter to the data flow to set the table name in the dataset.
ADF handles the cluster creation and teardown. All you have to worry about is whether you want to execute each sequentially or in parallel and how many. There are concurrency limits in ADF, so you should consider a batch count of 20 if you run in parallel.

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

Azure Table Storage duration before being able to retrieve data

I'm inserting like 6000 values into a Azure table storage. I'm inserting the values, 100 a time, in TableBatchOperations. Values are inserted in a async method which is awaited.
Now many of my integration tests fail. They're trying to retrieve previously inserted values, but instead of the 6K values, only 1K or 2K values are returned. If I insert a Task.Delay of multiple seconds in my test, it succeeds.
So table.ExecuteBatchAsync() runs to completion for all of my 60 batches. Does anybody know why there's still so much time between the (completed) insertion and being able to retrieve the data?
Note: you can reproduce this behavior with Microsoft Azure Table Explorer. During insertion, hit the refresh button for the table.
Note2: I've searched a lot for this phenomenon, can't really find any specs of Microsoft stating the time between insertion and being able to retrieve data. I also couldn't find any similar posts on Stackoverflow.

How can I copy dynamic data from on prem sqlserver to azure datawarehouse

I have created a linked service that takes the data from on prem and store into the azure blob, but my data is dynamic how can I build a pipeline that takes the updated table into the blob and takes that blob and transfer it into the azure datawarehouse, I need this in such a way so that all my tables are in realtime sync into the azure datawarehouse.
What you are probably looking for is incrementally loading data into your datawarehouse.
The procedure described below is documented here. It assumes you have periodic snapshots of your whole source table into blobstorage.
You need to elect a column to track changes in your table.
If you are only appending and never changing existing rows, the primary key will do the job.
However, if you have to cope with changes in existing rows, you need a way to track those changes (for instance in with a column named "timestamp-of-last-update" - or any better, more succinct name).
Note: if you don't have such a column, you will not be able to track changes and therefore will not be able to load data incrementally.
For a given snapshot, we are interested in the rows added or updated in the source table. This content is called the delta associated to the snapshot. Once delta is computed, it can be upserted into your table with a Copy Activity that invokes an stored procedure. Here you can find details on how this is done.
Assuming the values of the elected column will only grow as rows are added/updated in the source table, it is necessary to keep track of its maximum value through the snapshots. This tracked value is called watermark. This page describes a way to persist the watermark into SQL Server.
Finally, you need to be able to compute the delta for a given snapshot given the last watermark stored. The basic idea is to select the rows where the elected column is greater than the stored watermark. You can do so using SQL Server (as described in the referred documentation), or you can use Hive on HDInsight to do this filtering.
Do not forget to update the watermark with the maximum value of the elected column once delta is upserted into your datawarehouse.

Timeout when create index on a large table in Azure

I have table contains more than 50 million records in Azure. I'm trying to create a nonclustered index on it using follow statment
create nonclustered index market_index_1 on MarketData(symbol, currency) with(online=on)
But I get a error message.
Msg -2, Level 11, State 0, Line 0 Timeout expired. The timeout period
elapsed prior to completion of the operation or the server is not
responding.
Any suggestions would be greatly appreciated.
Check out the Azure SQL Database Resource Limits document. Then compare the error code with the error codes listed on this document.
With data of that size I believe the only way to create new index in that table would be:
Create new table with same structure and only one clustered index
Copy the data from original table into the new one
Truncate the original table
Create desired indexes
Copy data back into original table
Note that moving the data between the tables will potentially once again exceed the resource limits, so you might have to do these operations in chunks.
Other possible approach is to upgrade the database server to the new Preview Version of Azure SQL Database (note: you cannot downgrade the server later!)

Resources