Azure Cosmos SQL queries very slow, possible to optimize performance? - azure

Querying a collection that holds about 250k items and I need to pull back most of it. These items are partitioned by weeknumber. I've tried slicing the query up into many smaller queries but it seems the fastest I can have it return data is ~500 items/sec. I only want items that have a date within the past 3 months. The querying is done from my Vue web app client side. Any links or suggestions are welcome.
SELECT
c.weeknumber,
c.item_txt,
c.item,
c.trandate,
c.trandate_unix,
c.trandate_iso,
StringToNumber(c.TOTAL_QTY) AS total_qty,
StringToNumber(c.MA_TOTAL_QTY) AS ma_total_qty,
StringToNumber(c.IN_TOTAL_QTY) AS in_total_qty,
StringToNumber(c.CA_TOTAL_QTY) AS ca_total_qty
FROM c
WHERE c.trandate_unix >= ${threeMoAgo_start / 1000} AND c.trandate_unix <= ${threeMoAgo_end / 1000}

Related

Database design issue for Multi-tenant application

We have an application that does lot of data heavy work on the server for a multi-tenant workspace.
Here are the things that it do :
It loads data from files from different file format.
Execute idempotence rules based on the logic defined.
Execute processing logic like adding discount based on country for users / calculating tax amount etc.. These are specific to each tenant.
Generate refreshed data for bulk edit.
Now after these processing is done, the Tenant will go the the Interface, do some bulk edit overrides to users, and finally download them as some format.
We have tried a lot of solutions before like :
Doing it in one SQL database where each tenant is separated with tenant id
Doing it in Azure blobs.
Loading it from file system files.
But none has given performance. So what is presently designed is :
We have a Central database which keeps track of all the databases of Customers.
We have a number of Database Elastic Pools in Azure.
When a new tenant comes in, we create a Database, Do all the processing for the users and notify the user to do manual job.
When they have downloaded all the data we keep the Database for future.
Now, as you know, Elastic Pools has a limit of number of databases, which led us to create multiple Elastic pools, and eventually keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases.
Proposed Changes:
As gradually we are incurring more and more cost to our Azure account, we are thinking how to reduce this.
What I was proposing is :
We create one Elastic Pool, which has 500 database limit with enough DTU.
In this pool, we will create blank databases.
When a customer comes in, the data is loaded on any of the blank databases.
It does all the calculations, and notify the tenant for manual job.
When manual job is done, we keep the database for next 7 days.
After 7 days, we backup the database in Azure Blob, and do the cleanup job on the database.
Finally, if the same customer comes in again, we restore the backup on a blank database and continue. (This step might take 15 - 20 mins to setup, but it is fine for us.. but if we can reduce it would be even better)
What do you think best suited for this kind of problem ?
Our objective is how to reduce Azure cost, and also providing best solution to our customers. Please help on any architecture that you think would be best suited in this scenario.
Each customer can have millions of Record ... we see customers having 50 -100 GB of databases even... and also with different workloads for each tenant.
Here is where the problem starts:
"[...] When they have downloaded all the data we keep the Database for future."
This is very wrong because it leads to:
"[...] keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases."
This is not only a problem of costs but also a problem with security compliance.
How long should you store those data?
Are these data complying with what county policy?
Here is my 2 solution:
It goes by itself that if you don't need those data you just have to delete those databases. You will lower your costs immediately
If you cannot delete them, because they are not in use, switch from Elastic Pool to Serverless
EDIT:
Azure SQL Database gets expensive only when you use them.
If they are unused they will cost nothing. But "unused" means no connections to it. If you have some internal tool that wakes them up ever hours they will never fall in serverless state so you will pay a lot.
HOW TO TEST SERVERLESS:
Take a database that you you know it's unused and put it in serverless state for 1 week; you will see how the cost of that database drop on the Cost Management. And of course, take it off from the Elastc Pool.
You can run this query on the master database:
DECLARE #StartDate date = DATEADD(day, -30, GETDATE()) -- 14 Days
SELECT
##SERVERNAME AS ServerName
,database_name AS DatabaseName
,sysso.edition
,sysso.service_objective
,(SELECT TOP 1 dtu_limit FROM sys.resource_stats AS rs3 WHERE rs3.database_name = rs1.database_name ORDER BY rs3.start_time DESC) AS DTU
/*,(SELECT TOP 1 storage_in_megabytes FROM sys.resource_stats AS rs2 WHERE rs2.database_name = rs1.database_name ORDER BY rs2.start_time DESC) AS StorageMB */
/*,(SELECT TOP 1 allocated_storage_in_megabytes FROM sys.resource_stats AS rs4 WHERE rs4.database_name = rs1.database_name ORDER BY rs4.start_time DESC) AS Allocated_StorageMB*/
,avcon.AVG_Connections_per_Hour
,CAST(MAX(storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) StorageGB
,CAST(MAX(allocated_storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) Allocated_StorageGB
,MIN(end_time) AS StartTime
,MAX(end_time) AS EndTime
,CAST(AVG(avg_cpu_percent) AS decimal(4,2)) AS Avg_CPU
,MAX(avg_cpu_percent) AS Max_CPU
,(COUNT(database_name) - SUM(CASE WHEN avg_cpu_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [CPU Fit %]
,CAST(AVG(avg_data_io_percent) AS decimal(4,2)) AS Avg_IO
,MAX(avg_data_io_percent) AS Max_IO
,(COUNT(database_name) - SUM(CASE WHEN avg_data_io_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Data IO Fit %]
,CAST(AVG(avg_log_write_percent) AS decimal(4,2)) AS Avg_LogWrite
,MAX(avg_log_write_percent) AS Max_LogWrite
,(COUNT(database_name) - SUM(CASE WHEN avg_log_write_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Log Write Fit %]
,CAST(AVG(max_session_percent) AS decimal(4,2)) AS 'Average % of sessions'
,MAX(max_session_percent) AS 'Maximum % of sessions'
,CAST(AVG(max_worker_percent) AS decimal(4,2)) AS 'Average % of workers'
,MAX(max_worker_percent) AS 'Maximum % of workers'
FROM sys.resource_stats AS rs1
inner join sys.databases dbs on rs1.database_name = dbs.name
INNER JOIN sys.database_service_objectives sysso on sysso.database_id = dbs.database_id
inner join
(SELECT t.name
,round(avg(CAST(t.Count_Connections AS FLOAT)), 2) AS AVG_Connections_per_Hour
FROM (
SELECT name
--,database_name
--,success_count
--,start_time
,CONVERT(DATE, start_time) AS Dating
,DATEPART(HOUR, start_time) AS Houring
,sum(CASE
WHEN name = database_name
THEN success_count
ELSE 0
END) AS Count_Connections
FROM sys.database_connection_stats
CROSS JOIN sys.databases
WHERE start_time > #StartDate
AND database_id != 1
GROUP BY name
,CONVERT(DATE, start_time)
,DATEPART(HOUR, start_time)
) AS t
GROUP BY t.name) avcon on avcon.name = rs1.database_name
WHERE start_time > #StartDate
GROUP BY database_name, sysso.edition, sysso.service_objective,avcon.AVG_Connections_per_Hour
ORDER BY database_name , sysso.edition, sysso.service_objective
The query will return you statistics for all the databases on the server.
AVG_Connections_per_Hour: collects data of the last 30 days
All AVG and MAX statistics: collects data of the last 14 days
Pick a provider, and host the workloads. Under demand: provide fan-out among the cloud providers when needed.
This solution requires minimal transfer.
You could perhaps denormalise your needed data and store it in ClickHouse? It's a fast noSQL database for online analytical processing meaning that you can run queries which compute discount on the fly and it's very fast millions to billions rows per second. You will query using their custom SQL which is intuitive, powerful and can be extended with Python/C++.
You can try doing it like you did it before but with ClickHouse but opt in for a distributed deployment.
"Doing it in one SQL database where each tenant is separated with tenant id"
The deployment of Clickhouse cluster can be done on Kubernetes using the Altinity operator, it's free and you only have to pay for the resources, paid or managed options are also available.
ClickHouse also supports lots of integrations which means that you can perhaps stream data into it from Kafka or RabbitMQ or from local files/S3 Files
I've been running a test ClickHouse cluster with 150M rows and 70 columns mostly int64 fields. A DB query with 140 filters on all the columns took about 7-8 seconds on light load and 30-50s on heavy load. The cluster had 5 members (2 shards, 3 replicas).
Note: I'm not affiliated with ClickHouse, I just like the database. You could try to find another OLAP alternative on Azure.

Running a repetitive task in Node.js for each row in a postgres table on a different interval for each row

What would be a good approach to running a repetitive task for each row in a large postgres db table on a different per row interval in Node.js.
To give you some more context, here's a quick description of the application:
It's a chat based customer support app.
It consists of teams, which can be either a client team or a support team. Teams have users, which can be either client users or support users.
Client users send messages to a support team and wait for one of that team's users to answer their question.
When there's an unanswered client message waiting for a response, every agent for the receiving support team will receive a notification every n seconds (n being set on a per-team basis by the team admin).
So this task needs to infinitely loop through the rows in the teams table and send notifications if:
The team has messages waiting to be answered.
N seconds have passed since the last notification was sent (N being the number of seconds set by the team admin).
There might be a better approach to this condition altogether.
So my questions are:
What is an efficient way to infinitely loop through a postgres table with no upper limit on the number rows?
Should I load 1 row at a time? Several at a time?
What would be a good way to do this in Node?
I'm using Knex. Does Knex provide a mechanism for lazy loading a table and iterating through the rows?
A) Running a repetitive task via node can be done via a the js built-in function 'setInterval'.
// run the intervalFnc() every 5 seconds
const timerId = setTimeout(intervalFnc, 5000);
function intervalFnc() { console.log("Hello"); }
// to quit running it:
clearTimeout(timerId);
Then your interval function can do the actual work. An alternative would be to use cron (linux), or some OS process scheduler to trigger the function. I would use this method if you want to do it every minute, and a cron job if you want to do it every hour (in between these times becomes more debatable).
B) An efficient way...
B-1) Retrieving a block of records from a DB will be more efficient than one at a time. Knex has .offset and .limit clauses to choose a group of records to retrieve. A sample from the knex doc:
knex.select('*').from('users').limit(10).offset(30)
B-2) Database indexed access is important for performance if your tables are very large. I would recommend including an status flag field in your table to note which records are 'in-process', and also include a "next-review-timestamp" field with both fields being both indexed. Retrieve the records that have status_flag='in-process' AND next_review_timestamp <= now(). Sample:
knex('users').where('status_flag', 'in-process').whereRaw('next_review_timestamp <= now()')
Hope this helps!

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Bigquery search for limited records at each time

I am running below query in bigquery:
SELECT othermovie1.title,
(SUM(mymovie.rating*othermovie.rating) - (SUM(mymovie.rating) * SUM(othermovie.rating)) / COUNT(*)) /
(SQRT((SUM(POW(mymovie.rating,2)) - POW(SUM(mymovie.rating),2) / COUNT(*)) * (SUM(POW(othermovie.rating,2)) -
POW(SUM(othermovie.rating),2) / COUNT(*) ))) AS num_density
FROM [CFDB.CF] AS mymovie JOIN
[CFDB.CF] AS othermovie ON
mymovie.userId = othermovie.userId JOIN
[CFDB.CF] AS othermovie1 ON
othermovie.title = othermovie1.title JOIN
[CFDB.CF] AS mymovie1 ON
mymovie1.userId = othermovie1.userId
WHERE othermovie1.title != mymovie.title
AND mymovie.title = 'Truth About Cats & Dogs, The (1996)'
GROUP BY othermovie1.title
But it is a while that bigquery is still processing. Is there a way to paginate the query and request to FETCH NEXT 10 ROWS ONLY WHERE othermovie1.title != mymovie.title AND num_density > 0 at each time?
You won't find in BigQuery any concept related to paginating results so to increase processing performance.
Still, there's probably several things you can do to understand why it's taking so long and how to improve it.
For starters, I'd recommend using Standard SQL instead of Legacy as the former is capable of several optimizations plan that might help you in your case such as pushing down filters to happen before joins and not after, which is the case for you if you are using Legacy.
You can also make use of query plans explanation to diagnose more effectively where in your query the bottleneck is; lastly, make sure to follow the concepts discussed in best practices, it might help you to adapt your query to be more performative.

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Resources