I'm trying to provision Azure SQL DWH mostly Gen 2 type, but i'm not sure about the DWU that i need to set.
After analyzing source systems, on an average per day the DWH might be expecting nearly 1.5 million records. It will be inserted/updated to different set of tables.
With the no. of records is it possible to ascertain the DWU that needs to be set at DWH level.
Please advice.
While the volume of inserts is a useful number, it is more important to know the volume of data that will be frequently queried. We call this the active data set.
Let's say that you have 10Tb of data. If most of your queries address that whole 10Tb, then your active data set is 10Tb. However, if most of your queries only deal with 10% of your data, your active data set is 1Tb.
Some general guideline examples for DWUc by active data set:
1Tb: 500c
3Tb: 1000c
10Tb: 3000c
That said, in my experience the 1Tb/500c recommendation is a little small. That is because you're still working on a single node at less than 1000c; and your number of concurrent queries is limited to 20, with 20 concurrency slots. I like to see customers start at 1000c, and only use lesser DWU for dev/test, or during very quiet periods when the DW can't otherwise be paused.
Related
Context
Note: To be precise I have multiple data models on the same AAS
instance however from viewing the size of those models along with the usage graphs they don't seem to be impacting the memory usage by any significant amount. Therefore the discussion below focuses on the
"single data model" that to us seems to be most correlated to the observed
spikes.
I have a data model held in an Azure Analysis Services instance (the data model itself is a database inside the azure analysis services instance). The data model itself has been deployed using Visual Studio to the Azure Analysis Services instance. The data model is essentially created using data straight from SQL Server database (queries and stored procedures are being used to create the tables under the hood).
Note: Within this data model there are 16 tables in total. The largest 2 (as defined by % of model occupied & other metrics, which can be viewed via DAX Studio Vertipaq Analyzer) are the ones which have been partitioned day wise with 60 days partitions in total each (2022-04-11, 2022-04-12, ...) and handled via the partitioning automation procedure outlined in the Resources section below. The remaining 14 tables haven't been partitioned in that sense and are "fully processed" each time the refresh function triggers the refresh (effectively each of those 14 tables consist of 1 single large partition, i.e. the whole table).
E.g: Every hour, when our refresh functions triggers, the latest 3 partitions of our 2 large tables are re-processed and each of the remaining 14 tables are re-processed fully (since only 1 big partition which forms each of these tables).
The refreshes of the data model are performed using a function app which has functions which refresh the latest 3 daywise partitions of 2 of the largest tables in the data model while the other tables are processed whole every time during refresh.
Currently the function controlling the refresh execution is triggered to go off every hour, during which it performs a refresh of the data as described above.
The issue that we have been facing is that when we observe our memory usage dashboard (Check screenshot below) we tend to get massive spikes in the memory usage which seem to occur during this refresh phase.
In light of this observation we began trying to test out what seems to be causing these periodic spikes and observed the following interesting points:
Spikes align almost perfectly with the scheduled hourly refreshes of our data model.
Leading us to believe the spikes are related to the refresh process in some way.
In between these refreshes the memory usage drops significantly
Further making us believe the spikes is caused by some part of the refresh activity and not caused by general usage.
Increasing the number of partitions (time window of data for the 2 main tables) from 30 days to 60 days and vice versa causes significantly visible changes in the spikes
If we go up from 30 to 60 days the spikes amplitudes increase and the other way around causes it to decrease.
Performing "Defragmentation process" as outlined in the white paper viewable via the link in the Resources section temporarily reduces the usage by a little.
By nature of this process its something that would be need to be performed on a regular basis to ensure continued benefits.
The tables that are fully processed each time (all tables in main data model barring 2 which only refresh last 3 daily partitions) don't seem to cause a high impact on the memory usage spikes.
We manually processed some of the largest tables one after another in-between the refreshes and didn't notice a huge jump in the graphs.
Reducing the 3 daywise partitions to 3 hourly partitions for the 2 main tables refresh didn't seem to cause a big change either.
Noticed a small drop in the memory usage (about 1-2GB during hourly refresh) but didn't seem to have as large of an impact as we thought (proportional to the data reduction). This makes us think that the actual amount of data might not be the primary issue.
Screenshots
Here are some more details on the metrics used Definitions:
Turqoise Line: Hard memory limit max (same as the max cache size of our AAS tier).
Dark Blue Line: High memory limit max (approx 80% of our Hard Memory limit).
Orange Line: Memory Usage max. More details can be found in the following links: AAS Metrics, Memory Usage Forum Post
Questions
Based on our scenario (described above) what could be that cause of the memory usage spikes during refresh and how can we reduce and or manage them in a nice way (ideally removing entirely or as much as possible)? Basically always much below the turquoise and dark blue lines
We feel that if we can figure this out it may allow us to stay within our current pricing tier and also potentially allow us to bring in more data (90-120 days partitions) without the worry of hitting "Out of Memory" health alerts for our instance (which we have been receiving up until now with 60 days).
Note: Barring the current hourly refreshes we are well within the tier limits in terms of memory usage (orange line much lower than the threshold turquoise & blue). Thus solving this could free us to make better use of our AAS resources
Current Thoughts
We do have calculated columns in our data model. Could this be causing the issue?
What would be the best way to test this?
Resources
Will place any useful links to documentation in this section. Hopefully can aid in understanding the context.
Github Link for the repo containing Tabular model refresh logic we based our process off.
https://github.com/microsoft/Analysis-Services/tree/master/AsPartitionProcessing
In the README.md make sure to click the link to the white paper which provides more detail.
Please try setting MaxParallelism to some low value like 2 or 3 in the ModelConfiguration table. This will reduce the number of parallel tables and partitions it will process at once. This alone probably won’t solve the memory spike issue but it should lower the spike a little at the expense of longer refresh times. If you can deal with this tradeoff and it spikes memory less this may be a workaround.
Please set IsAvailableInMDX to false on any hidden columns or hidden measure columns which are not put on an axis or referenced directly in an MDX query. This should reduce your memory footprint during processing because it will not build attribute hierarchies for those columns. On high cardinality columns the savings could be significant.
The next thing to try would be to split the tables/partitions into separate ModelConfiguration rows in the database. Then configure it to process one ModelConfiguration then the other sequentially. The goal here would be to process some tables in one transaction and other tables in a separate transaction. That should cause the memory usage required for each transaction to be less. Of course this may impact users in that half of the data will be stale after the first transaction so you will have to judge whether this is feasible.
A more complex optimization would be to scale out AAS and have a dedicated processing node. You could then process clear the model before you full process it. That should reduce the memory requirements the most. Once processing is done you run the Synchronize command. You could even scale back in removing the processing node to save cost the rest of the hour.
Another option to consider would be to deploy the models to Power BI Premium Gen2. The very interesting nuance with Gen2 is that a P1 capacity allows each dataset to be up to 25GB unlike Gen1 and unlike AAS S1 where the total of all datasets must be less than 25GB. If your organization already owns Power BI Premium capacities this should be a good option. If not then the cost probably won’t make sense at the moment. Or you could license each user with a Power BI Premium Per User license and deploy the model to that Premium Per User capacity. If you have under 70 users this may be a more cost effective option for you to try.
My question about the Data Factory V2 copy-data activity i have 5 questions.
Questions 1
Should I use parquet file or SQL server With 500 DTU I want to transfer data fast to staging table or staging parquet file
Questions 2
Copy data activity data integration Unit should i use auto or 32 data integration Unit
Questions 3
What benefit of using degree of copy parallelism should I use Auto or use 32 again I want to transfer everything quick as possible I have around 50 million rows every day.
Questions 4
Data Flow Integration run time so should I use General Purpose, Compute Optimized or Memory Optimized as I mention we have 50 million rows every day, so we want to process the data as quickly as possible and somehow cheap if we can in Data Flow
Questions 5
A bulk insert is better in Data Factory and Data flow Sink
I think you have too many questions about too many topics, the answers to which will depend entirely on your desired end result. Even so, I will do my best to briefly address your situation.
If you are dealing with large volume and/or frequency, Data Flow (ADFDF) would probably be better than Copy activity. ADFDF runs on Spark via Data Bricks and is built from the ground up to run parallel workloads. Parquet is also built to support parallel workloads. If your SQL is an Azure Synapse (SQLDW) instance, then ADFDF will use Polybase to manage the upload, which is very fast because it is also built for parallel workloads. I'm not sure how this differs for Azure SQL, and there is no way to tell you what DTU level will work best for your task.
If having Parquet as your end result is acceptable, then that would probably be the easiest and least expensive to configure since it is just blob storage. ADFDF works just fine with Parquet, as either Source or Sink. For ETL workloads, Compute is the most likely IR configuration. The good news is it is the least expensive of the three. The bad news is I have no way to know what the core count should be, you'll just have to find out through trial and error. 50 million rows may sound like a lot, but it really depends on the row size (byte count and column count), and frequency. If the process is running many times a day, then you can include a "Time to live" value in the IR configuration. This will keep the cluster warm while it waits for another job, thus potentially reducing startup time (but incurring more run time cost).
I need to capture time-series sensor data in Cassandra. The best practices for handling time-series data in DynamoDB is as follow:
Create one table per time period, provisioned with write capacity less than 1,000 write capacity units (WCUs).
Before the end of each time period, prebuild the table for the next period.
As soon as a table is no longer being written to, reduce its provisioned write capacity. Also reduce the provisioned read capacity of earlier tables as they age, and archive or delete the ones whose contents will rarely or never be needed.
Now I am wondering how I can implement the same concept in Cassandra! Is there any way to manually configure write/read capacity in Cassandra as well?
This really depends on your own requirements that you need to discuss with development, etc.
There are several ways to handle time-series data in Cassandra:
Have one table for everything. As Chris mentioned, just include the time component into partition key, like a day, and store data per sensor/day. If the data won't be updated, and you know in advance how long they will be kept, so you can set TTL to data, then you can use TimeWindowCompactionStrategy. Advantage of this approach is that you have only one table and don't need to maintain multiple tables - that's make easier for development and maintenance.
The same approach as you described - create a separate table for period of time, like a month, and write data into them. In this case you can effectively drop the whole table when data "expires". Using this approach you can update data if necessary, and don't require to set TTL on data. But this requires more work for development and ops teams as you need to reach multiple tables. Also, take into account that there are some limits on the number of tables in the cluster - it's recommended not to have more than 200 tables as every table requires a memory to keep metadata, etc. Although, some things, like, a bloom filter, could be tuned to occupy less memory for tables that are rarely read.
For cassandra just make a single table but include some time period in the partition key (so the partitions do not grow indefinitely and get too large). No table maintenance and read/write capacity is really more dependent on workload and schema, size of cluster etc but shouldn't really need to be worried about except for sizing the cluster.
This started off as this question but now seems more appropriately asked specifically since I realised it is a DTU related question.
Basically, running:
select count(id) from mytable
EDIT: Adding a where clause does not seem to help.
Is taking between 8 and 30 minutes to run (whereas the same query on a local copy of SQL Server takes about 4 seconds).
Below is a screen shot of the MONITOR tab in the Azure portal when I run this query. Note I did this after not touching the Database for about a week and Azure reporting I had only used 1% of my DTUs.
A couple of extra things:
In this particular test, the query took 08:27s to run.
While it was running, the above chart actually showed the DTU line at 100% for a period.
The database is configured Standard Service Tier with S1 performance level.
The database is about 3.3GB and this is the largest table (the count is returning approx 2,000,000).
I appreciate it might just be my limited understanding but if somebody could clarify if this is really the expected behaviour (i.e. a simple count taking so long to run and maxing out my DTUs) it would be much appreciated.
From the query stats in your previous question we can see:
300ms CPU time
8000 physical reads
8:30 is about 500sec. We certainly are not CPU bound. 300ms CPU over 500sec is almost no utilization. We get 16 physical reads per second. That is far below what any physical disk can deliver. Also, the table is not fully cached as evidenced by the presence of physical IO.
I'd say you are throttled. S1 corresponds to
934 transactions per minute
for some definition of transaction. Thats about 15 trans/sec. Maybe you are hitting a limit of one physical IO per transaction?! 15 and 16 are suspiciously similar numbers.
Test this theory by upgrading the instance to a higher scale factor. You might find that SQL Azure Database cannot deliver the performance you want at an acceptable price.
You also should find that repeatedly scanning half of the table results in a fast query because the allotted buffer pool seems to fit most of the table (just not all of it).
I had the same issue. Updating the statistics with fullscan on the table solved it:
update statistics mytable with fullscan
select count
should perform clustered index scan if one is available and its up to date. Azure SQL should update statistics automatically, but does not rebuild indexes automatically if they are completely out of date.
if there's a lot of INSERT/UPDATE/DELETE traffic on that table I suggest manually rebuilding the indexes every once in a while.
http://blogs.msdn.com/b/dilkushp/archive/2013/07/28/fragmentation-in-sql-azure.aspx
and SO post for more info
SQL Azure and Indexes
I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.
Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .
Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.