Why my DB is locked when an ADF copy Activity is running? - azure

Just want to know why, when a Azure Data Factory (ADF) copy activity that's transfering data from the cloud to an on-prem db, blocks all the table.
When i click on the + button of the tables, a timeout error messagge appears. I'm dealing with huge data.
I'm not using a ForEach Activity but all Copy Activities run in parallel.

If you are using a ForEach activity, check the sequential box to make sure all tasks are not running in parallel and locking the tables. See this documentation for more info.
If "isSequential" is set to False, ensure that there is a correct configuration to run multiple executables. Otherwise, this property should be used with caution to avoid incurring write conflicts. For more information, see Parallel execution section.

Related

Designing a timer-triggered processor which relies on data from events

I am trying to design a timer-triggered processor (all in azure) which will process a set of records that are set out for it to be consumed. It will be grouping it based on a column, creating files out of it, and dumping in a blob container. The records that it will consume are supposed to be generated based on an event - when the event is raised, containing a key, which can be used to query the data for the record (the data/ record being generated is to be pulled from different services.)
This is what I am thinking currently
Event is raised to event-grid-topic
Azure Function(ConsumerApp) is event triggered, reads the key, calls a service API to get all the data, stores that record in storage
table, with flag ready to be consumed.
Azure Function(ProcessorApp) is timer triggered, will read from the storage table, group based on another column, create and dump them as
files. This can then mark the records as processed, if not updated
already by ConsumerApp.
Some of my questions on these, apart from any way we can do it in a different better way are -
The table storage is going to fill up quickly, which will again decrease the speed to read the 'ready cases' so is there any better approach to store this intermediate & temporary data? One thing which I thought was to regularly flush the table or delete the record from the consumer app instead of marking it as 'processed'
The service API is being called for each event, which might increase the strain on that service/its database. should I group the call for records as a single API call, since the processor will run only after a said interval, or is there a better approach here?
Any feedback on this approach or a new design will be appreciated.
If you don't have to process data on Step 2 individually, you can try saving it in a blob too and add a record the blob path in Azure Table Storage to keep minimal row count.
Azure Table Storage has partitions that you can use to partition your data and keep your read operations fast. Partition scan is faster compared to table scan. In addition, Azure Table Storage is cheap, but if you have pricing concern. Then you can write a clean up function to periodically clean the processed rows. Keeping the processed rows around for a reasonable time is usually a good idea. Because you may need those for debugging issues.
By batching multiple calls in a single call, you can decrease network I/O delay. But resource contention will remain at service level. You can try moving that API to a separate service if possible to scale it separately.

One-time jobs on Azure workers

What is the best way to do a one-time job on Azure?
Say we want to extend a table in the associated database with a double column. All the new entries will have this value computed by the worker(s) at insertion, but somebody has to take care of the entries that are already in the table. I thought of two alternatives:
a method called by the worker only if a database entry (say, "JobRun") is set to true, and the method would flip the entry to false.
a separate app that does the job, and which is downloaded and run manually using remote desktop (I cannot connect the local app to the Azure SQL server).
The first alternative is messy (how should I deal with the code at the next deployment? delete it? comment it? leave it there? also, what if I will have another job in the future? create a new database entry "Job2Run"?). The second one looks like a cheap hack. I am sure that there is a better way I could not think of.
If you want to run a job once you'll need to take into account the following:
Concurrency: While the job is running, make sure no other worker picks up the job and runs it at the same time (you can use leases for this. More info here).
Once the job is done, you'll need to keep track (in Table Storage, SQL Azure, ...) that the job completed successfully. The next time a worker tries to pick up the job, it will look in Table Storage / SQL Azure / ..., it will see that the job completed and skip the job.
Failure: Maybe your worker crashes during the job which should allow another worker to pick up the job without any issue.
In your specific use case I would also consider using a tool like dbup to manage updates to your schema and existing data with SQL Scripts. Tools like these keep track of which scripts have been executed by adding them in a table in the database.

SQL Azure distributing heavy read queries for reporting

We are using SQL Azure for our application and need some inputs on how to handle queries that scan a lot data for reporting. Our application is both read/write intensive and so we don't want the report queries to block the rest of the operations.
To avoid connection pooling issues caused by long running queries we put the code that queries the DB for reporting onto a worker role. This still does not avoid the database getting hit with a bunch of read only queries.
Is there something we are missing here - Could we setup a read only replica which all the reporting calls hit?
Any suggestions would be greatly appreciated.
Have a look at SQL Azure Data Sync. It will allow you to incrementally update your reporting database.
here are a couple of links to get you started
http://msdn.microsoft.com/en-us/library/hh667301.aspx
http://social.technet.microsoft.com/wiki/contents/articles/1821.sql-data-sync-overview.aspx
I think it is still in CTP though.
How about this:
Create a separate connection string for reporting, for example use a different Application Name
For your reporting queries use SET TRANSACTION ISOLATION LEVEL SNAPSHOT
This should prevent your long running queries blocking your operational queries. This will also allow your reports to get a consistent read.
Since you're talking about reporting I'm assuming you don't need real time data. In that case, you can consider creating a copy of your production database at a regular interval (every 12 hours for example).
In SQL Azure it's very easy to create a copy:
-- Execute on the master database.
-- Start copying.
CREATE DATABASE Database1B AS COPY OF Database1A;
Your reporting would happen on Database1B without impacting the actual production database (Database1A).
You are saying you have a lot of read-only queries...any possibility of caching them? (perfect since it is read-only)
What reporting tool are you using? You can output cache the queries as well if needed.

creating multi threaded windows service

I need guidance in creating a Multi Threaded Windows Service.
The Service needs to read records from a database Table and saves it to
another service(ServiceB). The table might contain thousands of records which
the service needs to read 100 at a time and do something and saves
it to ServiceB and again take another 100 records from the table
and process it and saves it to Service B. And the process should continue
like that till it finishes all the records.
At the same time this service will get the results from ServiceB.
and update the table with the result from ServiceB.
So this service needs to do two thing simultaneously.
Any idea what is the best method to do it or any help is appreciated.
I think you need to check about Thread. Microsoft has a good Tutorial for Thread.

How do I auto-start an Azure queue?

I want to build an Azure application that has two worker roles and NO web roles. When the worker roles first start up I want ONLY ONE of the roles to do the following a single time:
Download and parse a master file then enqueue multiple "child" tasks based on the
contents of the master file
Enqueue a single master file download "child" task to run the next day
Each of the "child" tasks would then be done by both of the workers until the task queue was exhausted. Think of the whole things as "priming the pump"
This sort of thing is really easy if I add the the first "master" task manually in a queue by calling a web role but seems to be really hard to do in an auto-start mode.
Any help in this regard would be greatly appreciated!
Thanks.....
One possibility: instead of calling a web role, just load the queue directly. (It sounds like this is the sort of application you'll want to automatically spin up to do some work and then shut down again... if you're automating that, it should be trivial to also automate loading the queue.)
A (perhaps) better option: Use some sort of locking mechanism to make sure only one worker instance does the initialization work. One way to do this is to try to create the queue (or a blob, or an entity in a table). If it already exists, then the other instance is handling initialization. If the create succeeds, then it's this instance's job.
Note that it's always better to use a lease than a lock, in case the instance that's doing the initialization fails. Consider using a timeout (e.g. storing a timestamp in table storage or in the metadata of the blob or in the name of the queue...).
We did end-up with the exact same sort of problem, that's why we introduced a O/C mapper (object to cloud). Basically, you want to introduce two types of cloud services:
QueueService that consumes messages whenever available.
ScheduledService that triggers operations on a scheduled basis.
Then, as others suggested, in the cloud, you really prefer using leases instead of locks, in order to avoid your cloud app to end up freezed forever due to a temporary hardware (or infrastructure) issue.

Resources