Database design issue for Multi-tenant application - azure

We have an application that does lot of data heavy work on the server for a multi-tenant workspace.
Here are the things that it do :
It loads data from files from different file format.
Execute idempotence rules based on the logic defined.
Execute processing logic like adding discount based on country for users / calculating tax amount etc.. These are specific to each tenant.
Generate refreshed data for bulk edit.
Now after these processing is done, the Tenant will go the the Interface, do some bulk edit overrides to users, and finally download them as some format.
We have tried a lot of solutions before like :
Doing it in one SQL database where each tenant is separated with tenant id
Doing it in Azure blobs.
Loading it from file system files.
But none has given performance. So what is presently designed is :
We have a Central database which keeps track of all the databases of Customers.
We have a number of Database Elastic Pools in Azure.
When a new tenant comes in, we create a Database, Do all the processing for the users and notify the user to do manual job.
When they have downloaded all the data we keep the Database for future.
Now, as you know, Elastic Pools has a limit of number of databases, which led us to create multiple Elastic pools, and eventually keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases.
Proposed Changes:
As gradually we are incurring more and more cost to our Azure account, we are thinking how to reduce this.
What I was proposing is :
We create one Elastic Pool, which has 500 database limit with enough DTU.
In this pool, we will create blank databases.
When a customer comes in, the data is loaded on any of the blank databases.
It does all the calculations, and notify the tenant for manual job.
When manual job is done, we keep the database for next 7 days.
After 7 days, we backup the database in Azure Blob, and do the cleanup job on the database.
Finally, if the same customer comes in again, we restore the backup on a blank database and continue. (This step might take 15 - 20 mins to setup, but it is fine for us.. but if we can reduce it would be even better)
What do you think best suited for this kind of problem ?
Our objective is how to reduce Azure cost, and also providing best solution to our customers. Please help on any architecture that you think would be best suited in this scenario.
Each customer can have millions of Record ... we see customers having 50 -100 GB of databases even... and also with different workloads for each tenant.

Here is where the problem starts:
"[...] When they have downloaded all the data we keep the Database for future."
This is very wrong because it leads to:
"[...] keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases."
This is not only a problem of costs but also a problem with security compliance.
How long should you store those data?
Are these data complying with what county policy?
Here is my 2 solution:
It goes by itself that if you don't need those data you just have to delete those databases. You will lower your costs immediately
If you cannot delete them, because they are not in use, switch from Elastic Pool to Serverless
EDIT:
Azure SQL Database gets expensive only when you use them.
If they are unused they will cost nothing. But "unused" means no connections to it. If you have some internal tool that wakes them up ever hours they will never fall in serverless state so you will pay a lot.
HOW TO TEST SERVERLESS:
Take a database that you you know it's unused and put it in serverless state for 1 week; you will see how the cost of that database drop on the Cost Management. And of course, take it off from the Elastc Pool.
You can run this query on the master database:
DECLARE #StartDate date = DATEADD(day, -30, GETDATE()) -- 14 Days
SELECT
##SERVERNAME AS ServerName
,database_name AS DatabaseName
,sysso.edition
,sysso.service_objective
,(SELECT TOP 1 dtu_limit FROM sys.resource_stats AS rs3 WHERE rs3.database_name = rs1.database_name ORDER BY rs3.start_time DESC) AS DTU
/*,(SELECT TOP 1 storage_in_megabytes FROM sys.resource_stats AS rs2 WHERE rs2.database_name = rs1.database_name ORDER BY rs2.start_time DESC) AS StorageMB */
/*,(SELECT TOP 1 allocated_storage_in_megabytes FROM sys.resource_stats AS rs4 WHERE rs4.database_name = rs1.database_name ORDER BY rs4.start_time DESC) AS Allocated_StorageMB*/
,avcon.AVG_Connections_per_Hour
,CAST(MAX(storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) StorageGB
,CAST(MAX(allocated_storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) Allocated_StorageGB
,MIN(end_time) AS StartTime
,MAX(end_time) AS EndTime
,CAST(AVG(avg_cpu_percent) AS decimal(4,2)) AS Avg_CPU
,MAX(avg_cpu_percent) AS Max_CPU
,(COUNT(database_name) - SUM(CASE WHEN avg_cpu_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [CPU Fit %]
,CAST(AVG(avg_data_io_percent) AS decimal(4,2)) AS Avg_IO
,MAX(avg_data_io_percent) AS Max_IO
,(COUNT(database_name) - SUM(CASE WHEN avg_data_io_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Data IO Fit %]
,CAST(AVG(avg_log_write_percent) AS decimal(4,2)) AS Avg_LogWrite
,MAX(avg_log_write_percent) AS Max_LogWrite
,(COUNT(database_name) - SUM(CASE WHEN avg_log_write_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Log Write Fit %]
,CAST(AVG(max_session_percent) AS decimal(4,2)) AS 'Average % of sessions'
,MAX(max_session_percent) AS 'Maximum % of sessions'
,CAST(AVG(max_worker_percent) AS decimal(4,2)) AS 'Average % of workers'
,MAX(max_worker_percent) AS 'Maximum % of workers'
FROM sys.resource_stats AS rs1
inner join sys.databases dbs on rs1.database_name = dbs.name
INNER JOIN sys.database_service_objectives sysso on sysso.database_id = dbs.database_id
inner join
(SELECT t.name
,round(avg(CAST(t.Count_Connections AS FLOAT)), 2) AS AVG_Connections_per_Hour
FROM (
SELECT name
--,database_name
--,success_count
--,start_time
,CONVERT(DATE, start_time) AS Dating
,DATEPART(HOUR, start_time) AS Houring
,sum(CASE
WHEN name = database_name
THEN success_count
ELSE 0
END) AS Count_Connections
FROM sys.database_connection_stats
CROSS JOIN sys.databases
WHERE start_time > #StartDate
AND database_id != 1
GROUP BY name
,CONVERT(DATE, start_time)
,DATEPART(HOUR, start_time)
) AS t
GROUP BY t.name) avcon on avcon.name = rs1.database_name
WHERE start_time > #StartDate
GROUP BY database_name, sysso.edition, sysso.service_objective,avcon.AVG_Connections_per_Hour
ORDER BY database_name , sysso.edition, sysso.service_objective
The query will return you statistics for all the databases on the server.
AVG_Connections_per_Hour: collects data of the last 30 days
All AVG and MAX statistics: collects data of the last 14 days

Pick a provider, and host the workloads. Under demand: provide fan-out among the cloud providers when needed.
This solution requires minimal transfer.

You could perhaps denormalise your needed data and store it in ClickHouse? It's a fast noSQL database for online analytical processing meaning that you can run queries which compute discount on the fly and it's very fast millions to billions rows per second. You will query using their custom SQL which is intuitive, powerful and can be extended with Python/C++.
You can try doing it like you did it before but with ClickHouse but opt in for a distributed deployment.
"Doing it in one SQL database where each tenant is separated with tenant id"
The deployment of Clickhouse cluster can be done on Kubernetes using the Altinity operator, it's free and you only have to pay for the resources, paid or managed options are also available.
ClickHouse also supports lots of integrations which means that you can perhaps stream data into it from Kafka or RabbitMQ or from local files/S3 Files
I've been running a test ClickHouse cluster with 150M rows and 70 columns mostly int64 fields. A DB query with 140 filters on all the columns took about 7-8 seconds on light load and 30-50s on heavy load. The cluster had 5 members (2 shards, 3 replicas).
Note: I'm not affiliated with ClickHouse, I just like the database. You could try to find another OLAP alternative on Azure.

Related

Azure Cosmos SQL queries very slow, possible to optimize performance?

Querying a collection that holds about 250k items and I need to pull back most of it. These items are partitioned by weeknumber. I've tried slicing the query up into many smaller queries but it seems the fastest I can have it return data is ~500 items/sec. I only want items that have a date within the past 3 months. The querying is done from my Vue web app client side. Any links or suggestions are welcome.
SELECT
c.weeknumber,
c.item_txt,
c.item,
c.trandate,
c.trandate_unix,
c.trandate_iso,
StringToNumber(c.TOTAL_QTY) AS total_qty,
StringToNumber(c.MA_TOTAL_QTY) AS ma_total_qty,
StringToNumber(c.IN_TOTAL_QTY) AS in_total_qty,
StringToNumber(c.CA_TOTAL_QTY) AS ca_total_qty
FROM c
WHERE c.trandate_unix >= ${threeMoAgo_start / 1000} AND c.trandate_unix <= ${threeMoAgo_end / 1000}

Inserting 1000 rows into Azure Database takes 13 seconds?

Can anyone please tell me why it might be taking 12+ seconds to insert 1000 rows into a SQL database hosted on Azure? I'm just getting started with Azure, and this is (obviously) absurd...
Create Table xyz (ID int primary key identity(1,1), FirstName varchar(20))
GO
create procedure InsertSomeRows as
set nocount on
Declare #StartTime datetime = getdate()
Declare #x int = 0;
While #X < 1000
Begin
insert into xyz (FirstName) select 'john'
Set #X = #X+1;
End
Select count(*) as Rows, DateDiff(SECOND, #StartTime, GetDate()) as SecondsPassed
from xyz
GO
Exec InsertSomeRows
Exec InsertSomeRows
Exec InsertSomeRows
GO
Drop Table xyz
Drop Procedure InsertSomeRows
Output:
Rows SecondsPassed
----------- -------------
1000 11
Rows SecondsPassed
----------- -------------
2000 13
Rows SecondsPassed
----------- -------------
3000 14
It's likely the performance tier you are on that is causing this. With a Standard S0 tier you only have 10 DTUs (Database throughput units). If you haven't already, read up on the SQL Database Service Tiers. If you aren't familiar with DTUs it is a bit of a shift from on-premises SQL Server. The amount of CPU, Memory, Log IO and Data IO are all wrapped up in which service tier you select. Just like on premises if you start to hit the upper bounds of what your machine can handle things slow down, start to queue up and eventually start timing out.
Run your test again just as you have been doing, but then use the Azure Portal to watch the DTU % used while the test is underway. If you see that the DTU% is getting maxed out then the issue is that you've chosen a service tier that doesn't have enough resources to handle you've applied without slowing down. If the speed isn't acceptable, then move up to the next service tier until the speed is acceptable. You pay more for more performance.
I'd recommend not paying too close attention to the service tier based on this test, but rather on the actual load you want to apply to the production system. This test will give you an idea and a better understanding of DTUs, but it may or may not represent the actual throughput you need for your production loads (which could be even heavier!).
Don't forget that in Azure SQL DB you can also scale your Database as needed so that you have the performance you need but can then back down during times you don't. The database will be accessible during most of the scaling operations (though note it can take a time to do the scaling operation and there may be a second or two of not being able to connect).
Two factors made the biggest difference. First, I wrapped all the inserts into a single transaction. That got me from 100 inserts per second to about 2500. Then I upgraded the server to a PREMIUM P4 tier and now I can insert 25,000 per second (inside a transaction.)
It's going to take some getting used to using an Azure server and what best practices give me the results I need.
My theory: Each insert is one log IO. Here, this would be 100 IOs/sec. That sounds like a reasonable limit on an S0. Can you try with a transaction wrapped around the inserts?
So wrapping the inserts in a single transaction did indeed speed this up. Inside the transaction it can insert about 2500 rows per second
So that explains it. Now the results are no longer catastrophic. I would now advise looking at metrics such as the Azure dashboard DTU utilization and wait stats. If you post them here I'll take a look.
one way to improve performance ,is to look at Wait Stats of the query
Looking at Wait stats,will give you exact bottle neck when a query is running..In your case ,it turned out to be LOGIO..Look here to know more about this approach :SQL Server Performance Tuning Using Wait Statistics
Also i recommend changing while loop to some thing set based,if this query is not a Psuedo query and you are running this very often
Set based solution:
create proc usp_test
(
#n int
)
Begin
begin try
begin tran
insert into yourtable
select n ,'John' from
numbers
where n<#n
commit
begin catch
--catch errors
end catch
end try
end
You will have to create numbers table for this to work
I had terrible performance problems with updates & deletes in Azure until I discovered a few techniques:
Copy data to a temporary table and make updates in the temp table, then copy back to a permanent table when done.
Create a clustered index on the table being updated (partitioning didn't work as well)
For inserts, I am using bulk inserts and getting acceptable performance.

oracle: Is there a way to check what sql_id downgraded to serial or lesser degree over the period of time

I would like to know if there is a way to check sql_ids that were downgraded to either serial or lesser degree in an Oracle 4-node RAC Data warehouse, version 11.2.0.3. I want to write a script and check the queries that are downgraded.
SELECT NAME, inst_id, VALUE FROM GV$SYSSTAT
WHERE UPPER (NAME) LIKE '%PARALLEL OPERATIONS%'
OR UPPER (NAME) LIKE '%PARALLELIZED%' OR UPPER (NAME) LIKE '%PX%'
NAME VALUE
queries parallelized 56083
DML statements parallelized 6
DDL statements parallelized 160
DFO trees parallelized 56249
Parallel operations not downgraded 56128
Parallel operations downgraded to serial 951
Parallel operations downgraded 75 to 99 pct 0
Parallel operations downgraded 50 to 75 pct 0
Parallel operations downgraded 25 to 50 pct 119
Parallel operations downgraded 1 to 25 pct 2
Does it ever refresh? What conclusion can be drawn from above output? Is it for a day? month? hour? since startup?
This information is stored as part of Real-Time SQL Monitoring. But it requires licensing the Diagnostics and Tuning packs, and it only stores data for a short period of time.
Oracle 12c can supposedly store SQL Monitoring data for longer periods of time. If you don't have Oracle 12c, or if you don't have those options licensed, you'll need to create your own monitoring tool.
Real-Time SQL Monitoring of Parallel Downgrades
select /*+ parallel(1000) */ * from dba_objects;
select sql_id, sql_text, px_servers_requested, px_servers_allocated
from v$sql_monitor
where px_servers_requested <> px_servers_allocated;
SQL_ID SQL_TEXT PX_SERVERS_REQUESTED PX_SERVERS_ALLOCATED
6gtf8np006p9g select /*+ parallel ... 3000 64
Creating a (Simple) Historical Monitoring Tool
Simplicity is the key here. Real-Time SQL Monitoring is deceptively simple and you could easily spend weeks trying to recreate even a tiny portion of it. Keep in mind that you only need to sample a very small amount of all activity to get enough information to troubleshoot. For example, just store the results of GV$SESSION or GV$SQL_MONITOR (if you have the license) every minute. If the query doesn't show up from sampling every minute then it's not a performance issue and can be ignored.
For example: create a table create table downgrade_check(sql_id varchar2(100), total number), and create a job with DBMS_SCHEDULER to run insert into downgrade_check select sql_id, count(*) total from gv$session where sql_id is not null group by sql_id;. Although the count from GV$SESSION will rarely be exactly the same as the DOP.
Other Questions
V$SYSSTAT is updated pretty frequently (every few seconds?), and represents the total number of events since the instance started.
It's difficult to draw many conclusions from those numbers. From my experience, having only 2% of your statements downgraded is a good sign. You likely either have good (usually default) settings and not too many parallel jobs running at once.
However, some parallel queries run for seconds and some run for weeks. If the wrong job is downgraded even a single downgrade can be disastrous. Storing some historical session information (or using DBA_HIST_ACTIVE_SESSION_HISTORY) may help you find out if your critical jobs were affected.

Understanding Azure SQL Performance

The facts:
1 Azure SQL S0 instance
a few tables one of them containing ~ 8.6 Million Rows and 1 PK
Running a Count-query on this table takes nearly 30 minutes (!) to complete.
Upscaling the instance from S0 to S1 reduces the query time to 13 minutes:
Looking into Azure Portal (new version) the resource-usage-monitor shows the following:
Questions:
Does anyone else consider even 13 minutes as rediculos for a simple COUNT()?
Does the second screenshot meen that during the 100%-period my instance isn't responding to other requests?
Why are my metrics limited to 100% in both S0 and S1? (see look under "Which Service Tier is Right for My Database?" stating " These values can be above 100% (a big improvement over the values in the preview that were limited to a maximum of 100).") I'd expect the S0 to bee like on 150% or so if the quoted statement is true.
I'm interested in experiences regarding usage of databases with more than 1.000 records or so from other people. I don't see how a S*-scaled Azure SQL for 22 - 55 € per month could help me in upscaling-strategies at the moment.
Azure SQL Database editions provide increasing levels of DTUs from Basic -> Standard -> Premium levels (CPU,IO,Memory and other resources - see https://msdn.microsoft.com/en-us/library/azure/dn741336.aspx). Once your query reaches its limits of DTU (100%) in any of these resource dimensions, it will continue to receive these resources at that level (but not more) and that may increase the latency in completing the request. It looks like in your scenario above, the query is hitting its DTU limit (10 DTUs for S0 and 20 for S1). You can see the individual resource usage percentages (CPU, Data IO or Log IO) by adding these metrics to the same graph, or by querying the DMV sys.dm_db_resource_stats.
Here is a blog that provides more information on appropriately sizing your database performance levels. http://azure.microsoft.com/blog/2014/09/11/azure-sql-database-introduces-new-near-real-time-performance-metrics/
To your specific questions
1) As you have 8.6 million rows, database needs to scan the index entries to get the count back. So, it may be hitting the IO limit for the edition here.
2) If you have multiple concurrent queries running against your DB, they will be scheduled appropriately to not starve one request or the other. But latencies may increase further for all queries since you will be hitting the available resource limits.
3) For older Web/Business editions, you may be able to see the metric values going beyond 100% (they are normalized to the limits of an S2 level), as they don't have any specific limits and run in a resource-shared environment with other customer loads. For the new editions, metrics will never exceed 100%, because system guarantees you resources upto 100% of that edition's limits, but no more. This provides predictable, guaranteed amount of resources for your DB unlike Web/Business editions, where you may get very little or lot more resources at different times depending on other competing customer DB workloads running on the same machine.
Hope this helps.
-- Srini

Comparing the new SQL Azure tiers to old ones [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Now that Microsoft made the new SQL Azure service tiers available (Basic, Standard, Premium) we are trying to figure out how they map to the existing ones (Web and Business).
Essentially, there are six performance levels in the new tier breakdown: Basic, S1, S2, P1, P2 and P3 (details here: http://msdn.microsoft.com/library/dn741336.aspx)
Does anyone know how the old database tiers map to those six levels? For instance, is Business equivalent of an S1? S2?
We need to be able to answer this question in order to figure out what service tiers/levels to migrate our existing databases to.
We just finished a performance comparison.
I can't publish our SQL queries, but we used 3 different test cases that match our normal activity. In each test case, we performed several queries with table joins and aggregate calculations (SUM, AVG, etc) for a few thousand rows. Our test database is modest - about 5GB in size with a few million rows.
A few notes: For each, we tested the my local machine which is a 5 year old iMac running Windows/SQL Server in a virtual machine ("Local"), SQL Azure Business ("Business"), SQL Azure Premium P1, SQL Azure Standard S2, and SQL Azure Standard S1. The basic tier seemed so slow that we didn't test it. All of these tests were done with no other activity on the system. The queries did not return data so network performance was hopefully not a factor.
Here were our results:
Test One
Local: 1 second
Business: 2 seconds
P1: 2 seconds
S2: 4 seconds
S1: 14 seconds
Test Two
Local: 2 seconds
Business: 5 seconds
P1: 5 seconds
S2: 10 seconds
S1: 30 seconds
Test Three
Local: 5 seconds
Business: 12 seconds
P1: 13 seconds
S2: 25 seconds
S1: 77 seconds
Conclusions:
After working with the different tiers for a few days, our team concluded a few things:
P1 appears to perform at the same level as SQL Azure Business. (P1 is 10x the price)
Basic and S1 are way too slow for anything but a starter database.
The Business tier is a shared service so performance depends on what other users are on your server. Our database shows a max of 4.01% CPU, 0.77% Data IO ,0.14% Log IO and we're experiencing major performance problems and timeouts. Microsoft Support confirmed that we are "just on a really busy server."
The Business tier delivers inconsistent service across servers and regions. In our case, we moved to a different server in a different region and our service is back to normal. (we view that as a temporary solution)
S1, S2, P1 tiers seem to provide the same performance across regions. We tested West and North Central.
Considering the results above, we're generally worried about the future of SQL Azure. The business tier has been great for us for a few years, but it's scheduled to go out of service in 12 months. The new tiers seem over priced compared to the Business tier.
I'm sure there are 100 ways this could be more scientific, but I'm hoping those stats help others getting ready to evaluate.
UPDATE:
Microsoft Support sent us a very helpful query to assess your database usage.
SELECT
avg(avg_cpu_percent) AS 'Average CPU Percentage Used',
max(avg_cpu_percent) AS 'Maximum CPU Percentage Used',
avg(avg_physical_data_read_percent) AS 'Average Physical IOPS Percentage',
max(avg_physical_data_read_percent) AS 'Maximum Physical IOPS Percentage',
avg(avg_log_write_percent) AS 'Average Log Write Percentage',
max(avg_log_write_percent) AS 'Maximum Log Write Percentage',
--avg(avg_memory_percent) AS 'Average Memory Used Percentage',
--max(avg_memory_percent) AS 'Maximum Memory Used Percentage',
avg(active_worker_count) AS 'Average # of Workers',
max(active_worker_count) AS 'Maximum # of Workers'
FROM sys.resource_stats
WHERE database_name = 'YOUR_DATABASE_NAME' AND
start_time > DATEADD(day, -7, GETDATE())
The most useful part is that the percentages represent % of an S2 instance. According to Microsoft Support, if you're at 100%, you're using 100% of an S2, 200% would be equivalent to a P1 instance.
We're having very good luck with P1 instances now, although the price difference has been a shocker.
I am the author of the Azure SQL Database Performance Testing blog posts mentioned above.
Making IOPS to DTU comparisons is quite difficult for Azure SQL Database, which is why I focussed on row counts and throughput rates (in MB per second) in my tests.
I would be cautious about using the Transaction Rates quoted by Microsoft - their benchmark databases are rather small e.g. for Standard tier, which has a capacity of 250 GB, their benchmark databases for S1 and S2 are only 2 GB and 7 GB respectively. At these sizes I suggest SQL Server is caching much/most of the database and as such their benchmark is avoid the worst of the read throttling that is likely to impact real world databases.
I have added a new post regarding the new Service Tiers hitting General Availability and making some estimates of the changes in performance around S0 and S1 at GA.
http://cbailiss.wordpress.com/2014/09/16/performance-in-new-azure-sql-database-performance-tiers/
There is not really any kind of mapping between the old and new offerings near as I can tell. The old offerings the only thing that was really different between the "web" and "business" offering was the size the database was limited to.
However, on the new offerings each tier has performance metrics associated with them. So in order to decide what offering you need to move your existing databases to you need to figure out what type of performance needs your application has.
It appears in terms of size Web and Business fall between Basic and S1. Here's a link that has a chart with the new and old tiers compared. It seems a little apple to oranges honestly so there isn't a direct mapping. Here's also a link specifically addressed to people currently on the Web and Business Tiers.
Comparison of Tiers
Web and Business Edition Sunset FAQ

Resources