I have a DWH running on Azure Synapse dedicated pool.
In addition to existing nightly/daily ETL processes, I need to add another in parallel that will kill performance of the current instance. That process is required to be run only 1 week per month during day time.
Similar to a Snowflake approach, is it possible to set up independent Azure Synapse compute to process the same data as the first instance? Not a copy of data, but the same data in the same files.
Or should I simply change instance size 2 times a day for 1 weak per month? (Requires to pause all activity)
Any advise will be appreciated!
Thanks!
I agree that scaling up or using a serverless SQL pool is a good option.
Before implementing I would also evaluate if the additional (and/or existing) process you are adding is properly optimized for MPP. Validate first that you are effectively co-locating data as much as possible via leveraging common HASH distributions. Often times ETL written first for SQL server (SMP) needs some amount of refactoring to truly leverage the power of MPP.
Look at query plans for long running jobs - is there excessive data broadcasting or shuffling? Fix via updating table distributions
Are statistics available and up to date?
Related
I am trying to design a timer-triggered processor (all in azure) which will process a set of records that are set out for it to be consumed. It will be grouping it based on a column, creating files out of it, and dumping in a blob container. The records that it will consume are supposed to be generated based on an event - when the event is raised, containing a key, which can be used to query the data for the record (the data/ record being generated is to be pulled from different services.)
This is what I am thinking currently
Event is raised to event-grid-topic
Azure Function(ConsumerApp) is event triggered, reads the key, calls a service API to get all the data, stores that record in storage
table, with flag ready to be consumed.
Azure Function(ProcessorApp) is timer triggered, will read from the storage table, group based on another column, create and dump them as
files. This can then mark the records as processed, if not updated
already by ConsumerApp.
Some of my questions on these, apart from any way we can do it in a different better way are -
The table storage is going to fill up quickly, which will again decrease the speed to read the 'ready cases' so is there any better approach to store this intermediate & temporary data? One thing which I thought was to regularly flush the table or delete the record from the consumer app instead of marking it as 'processed'
The service API is being called for each event, which might increase the strain on that service/its database. should I group the call for records as a single API call, since the processor will run only after a said interval, or is there a better approach here?
Any feedback on this approach or a new design will be appreciated.
If you don't have to process data on Step 2 individually, you can try saving it in a blob too and add a record the blob path in Azure Table Storage to keep minimal row count.
Azure Table Storage has partitions that you can use to partition your data and keep your read operations fast. Partition scan is faster compared to table scan. In addition, Azure Table Storage is cheap, but if you have pricing concern. Then you can write a clean up function to periodically clean the processed rows. Keeping the processed rows around for a reasonable time is usually a good idea. Because you may need those for debugging issues.
By batching multiple calls in a single call, you can decrease network I/O delay. But resource contention will remain at service level. You can try moving that API to a separate service if possible to scale it separately.
In Azure Cosmos DB, is it possible to create multiple read replicas at a database / container / partition key level to increase read throughput? I have several containers that will need more than 10K RU/s per logical partition key, and re-designing my partition key logic is not an option right now. Thus, I'm thinking of replicating data (eventual consistency is fine) several times.
I know Azure offers global distribution with Cosmos DB, but what I'm looking for is replication within the same region and ideally not a full database replication but a container replication. A container-level replication will be more cost effective since I don't need to replicate most containers and I need to replicate the others up to 10 times.
Few options available though:
Within same region no replication option but you could use the Change Feed to replicate to another DB with the re-design in mind just for the purpose of using with read queries. Though it might be a better idea to either use the serverless option which is in preview or use the auto scale option as well. But you can also look at provisioned throughput and reserve the provisioned RUs for 1 year or 3 year and pay monthly just as you would in PAYG model but with a huge discount. One option would also be to do a latency test from the VM in the region where your main DB and app is running and finding out the closest region w.r.t Latency (ms) and then if the latency is bearable then you can use global replication to that region and start using it. I use this tool for latency tests but run it from the VM within the region where your app\DB is running.
My guess is your queries are all cross-partition and each query is consuming a ton of RU/s. Unfortunately there is no feature in Cosmos DB to help in the way you're asking.
Your options are to create more containers and use change feed to replicate data to them in-region and then add some sort of routing mechanism to route requests in your app. You will of course only get eventual consistency with this so if you have high concurrency needs this won't work. Your only real option then is to address the scalability issues with your design today.
There is something that can help is this live data migrator. You can use this to keep a second container in sync with your original that will allow you to eventually migrate off your first design to another that scales better.
I have an executable that performs long calculations and I want to run those calculations on Azure. What would be the optimal service - batch or VM perhaps?
Azure batch or VM scale sets. Azure Batch is based on top of scale sets and is more specifically designed for task/jobs while VM scalesets help for scaling generic VMs.
Use cases for Batch:
Batch is a managed Azure service that is used for batch processing or batch computing--running a large volume of similar tasks to get some desired result. Batch computing is most commonly used by organizations that regularly process, transform, and analyze large volumes of data.
Batch works well with intrinsically parallel (also known as "embarrassingly parallel") applications and workloads. Intrinsically parallel workloads are easily split into multiple tasks that perform work simultaneously on many computers.
More info here for batch: https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview/
if you can change the doctype to multi-part and you're able to suspend your long job every minute or so and update progress, that will make it more user interactive and stops the http connection timing out. you could also add a cancel job button? or is the question about something else?
I am planning on running Sharepoint Foundation on one VM size A3 and SQL Server on another of size A6. As far as I understand this is not enough to achieve SLA and I should use 2 more instances - one for Sharepoint and one for SQL Server configured in 2 seperate availability sets.
Can I use scaling (by CPU usage) to turn off one instance and leave only one running at a time in an availability set? This would reduce the costs but I wonder if this solution will be good enough to achieve Azure's SLA. The way I see it one instance is running at a time while other one is shut down so I am billed for one instance. When there is an update or failure going on, the instance that until then has been running is shut down and the other one comes online. Is this the way it works? Can I cut costs of availability sets like this?
no, the SLA requires two running instances. However, if you want to control your costs, the approach you have in place will work. Just keep in mind that the duration/window for a disruption will be dependent on how quickly you detect that the primary VM has failed, and how fast you can start the secondary VM. And depending on the nature of the service disruption, it may not be possible for you to start the secondary. So its a risk.
We are using SQL Azure for our application and need some inputs on how to handle queries that scan a lot data for reporting. Our application is both read/write intensive and so we don't want the report queries to block the rest of the operations.
To avoid connection pooling issues caused by long running queries we put the code that queries the DB for reporting onto a worker role. This still does not avoid the database getting hit with a bunch of read only queries.
Is there something we are missing here - Could we setup a read only replica which all the reporting calls hit?
Any suggestions would be greatly appreciated.
Have a look at SQL Azure Data Sync. It will allow you to incrementally update your reporting database.
here are a couple of links to get you started
http://msdn.microsoft.com/en-us/library/hh667301.aspx
http://social.technet.microsoft.com/wiki/contents/articles/1821.sql-data-sync-overview.aspx
I think it is still in CTP though.
How about this:
Create a separate connection string for reporting, for example use a different Application Name
For your reporting queries use SET TRANSACTION ISOLATION LEVEL SNAPSHOT
This should prevent your long running queries blocking your operational queries. This will also allow your reports to get a consistent read.
Since you're talking about reporting I'm assuming you don't need real time data. In that case, you can consider creating a copy of your production database at a regular interval (every 12 hours for example).
In SQL Azure it's very easy to create a copy:
-- Execute on the master database.
-- Start copying.
CREATE DATABASE Database1B AS COPY OF Database1A;
Your reporting would happen on Database1B without impacting the actual production database (Database1A).
You are saying you have a lot of read-only queries...any possibility of caching them? (perfect since it is read-only)
What reporting tool are you using? You can output cache the queries as well if needed.