Bigquery search for limited records at each time

Bigquery search for limited records at each time - pagination

I am running below query in bigquery:
SELECT othermovie1.title,
(SUM(mymovie.rating*othermovie.rating) - (SUM(mymovie.rating) * SUM(othermovie.rating)) / COUNT(*)) /
(SQRT((SUM(POW(mymovie.rating,2)) - POW(SUM(mymovie.rating),2) / COUNT(*)) * (SUM(POW(othermovie.rating,2)) -
POW(SUM(othermovie.rating),2) / COUNT(*) ))) AS num_density
FROM [CFDB.CF] AS mymovie JOIN
[CFDB.CF] AS othermovie ON
mymovie.userId = othermovie.userId JOIN
[CFDB.CF] AS othermovie1 ON
othermovie.title = othermovie1.title JOIN
[CFDB.CF] AS mymovie1 ON
mymovie1.userId = othermovie1.userId
WHERE othermovie1.title != mymovie.title
AND mymovie.title = 'Truth About Cats & Dogs, The (1996)'
GROUP BY othermovie1.title
But it is a while that bigquery is still processing. Is there a way to paginate the query and request to FETCH NEXT 10 ROWS ONLY WHERE othermovie1.title != mymovie.title AND num_density > 0 at each time?

You won't find in BigQuery any concept related to paginating results so to increase processing performance.
Still, there's probably several things you can do to understand why it's taking so long and how to improve it.
For starters, I'd recommend using Standard SQL instead of Legacy as the former is capable of several optimizations plan that might help you in your case such as pushing down filters to happen before joins and not after, which is the case for you if you are using Legacy.
You can also make use of query plans explanation to diagnose more effectively where in your query the bottleneck is; lastly, make sure to follow the concepts discussed in best practices, it might help you to adapt your query to be more performative.

Related

Database design issue for Multi-tenant application

We have an application that does lot of data heavy work on the server for a multi-tenant workspace.
Here are the things that it do :
It loads data from files from different file format.
Execute idempotence rules based on the logic defined.
Execute processing logic like adding discount based on country for users / calculating tax amount etc.. These are specific to each tenant.
Generate refreshed data for bulk edit.
Now after these processing is done, the Tenant will go the the Interface, do some bulk edit overrides to users, and finally download them as some format.
We have tried a lot of solutions before like :
Doing it in one SQL database where each tenant is separated with tenant id
Doing it in Azure blobs.
Loading it from file system files.
But none has given performance. So what is presently designed is :
We have a Central database which keeps track of all the databases of Customers.
We have a number of Database Elastic Pools in Azure.
When a new tenant comes in, we create a Database, Do all the processing for the users and notify the user to do manual job.
When they have downloaded all the data we keep the Database for future.
Now, as you know, Elastic Pools has a limit of number of databases, which led us to create multiple Elastic pools, and eventually keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases.
Proposed Changes:
As gradually we are incurring more and more cost to our Azure account, we are thinking how to reduce this.
What I was proposing is :
We create one Elastic Pool, which has 500 database limit with enough DTU.
In this pool, we will create blank databases.
When a customer comes in, the data is loaded on any of the blank databases.
It does all the calculations, and notify the tenant for manual job.
When manual job is done, we keep the database for next 7 days.
After 7 days, we backup the database in Azure Blob, and do the cleanup job on the database.
Finally, if the same customer comes in again, we restore the backup on a blank database and continue. (This step might take 15 - 20 mins to setup, but it is fine for us.. but if we can reduce it would be even better)
What do you think best suited for this kind of problem ?
Our objective is how to reduce Azure cost, and also providing best solution to our customers. Please help on any architecture that you think would be best suited in this scenario.
Each customer can have millions of Record ... we see customers having 50 -100 GB of databases even... and also with different workloads for each tenant.

Here is where the problem starts:
"[...] When they have downloaded all the data we keep the Database for future."
This is very wrong because it leads to:
"[...] keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases."
This is not only a problem of costs but also a problem with security compliance.
How long should you store those data?
Are these data complying with what county policy?
Here is my 2 solution:
It goes by itself that if you don't need those data you just have to delete those databases. You will lower your costs immediately
If you cannot delete them, because they are not in use, switch from Elastic Pool to Serverless
EDIT:
Azure SQL Database gets expensive only when you use them.
If they are unused they will cost nothing. But "unused" means no connections to it. If you have some internal tool that wakes them up ever hours they will never fall in serverless state so you will pay a lot.
HOW TO TEST SERVERLESS:
Take a database that you you know it's unused and put it in serverless state for 1 week; you will see how the cost of that database drop on the Cost Management. And of course, take it off from the Elastc Pool.
You can run this query on the master database:
DECLARE #StartDate date = DATEADD(day, -30, GETDATE()) -- 14 Days
SELECT
##SERVERNAME AS ServerName
,database_name AS DatabaseName
,sysso.edition
,sysso.service_objective
,(SELECT TOP 1 dtu_limit FROM sys.resource_stats AS rs3 WHERE rs3.database_name = rs1.database_name ORDER BY rs3.start_time DESC) AS DTU
/*,(SELECT TOP 1 storage_in_megabytes FROM sys.resource_stats AS rs2 WHERE rs2.database_name = rs1.database_name ORDER BY rs2.start_time DESC) AS StorageMB */
/*,(SELECT TOP 1 allocated_storage_in_megabytes FROM sys.resource_stats AS rs4 WHERE rs4.database_name = rs1.database_name ORDER BY rs4.start_time DESC) AS Allocated_StorageMB*/
,avcon.AVG_Connections_per_Hour
,CAST(MAX(storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) StorageGB
,CAST(MAX(allocated_storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) Allocated_StorageGB
,MIN(end_time) AS StartTime
,MAX(end_time) AS EndTime
,CAST(AVG(avg_cpu_percent) AS decimal(4,2)) AS Avg_CPU
,MAX(avg_cpu_percent) AS Max_CPU
,(COUNT(database_name) - SUM(CASE WHEN avg_cpu_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [CPU Fit %]
,CAST(AVG(avg_data_io_percent) AS decimal(4,2)) AS Avg_IO
,MAX(avg_data_io_percent) AS Max_IO
,(COUNT(database_name) - SUM(CASE WHEN avg_data_io_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Data IO Fit %]
,CAST(AVG(avg_log_write_percent) AS decimal(4,2)) AS Avg_LogWrite
,MAX(avg_log_write_percent) AS Max_LogWrite
,(COUNT(database_name) - SUM(CASE WHEN avg_log_write_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Log Write Fit %]
,CAST(AVG(max_session_percent) AS decimal(4,2)) AS 'Average % of sessions'
,MAX(max_session_percent) AS 'Maximum % of sessions'
,CAST(AVG(max_worker_percent) AS decimal(4,2)) AS 'Average % of workers'
,MAX(max_worker_percent) AS 'Maximum % of workers'
FROM sys.resource_stats AS rs1
inner join sys.databases dbs on rs1.database_name = dbs.name
INNER JOIN sys.database_service_objectives sysso on sysso.database_id = dbs.database_id
inner join
(SELECT t.name
,round(avg(CAST(t.Count_Connections AS FLOAT)), 2) AS AVG_Connections_per_Hour
FROM (
SELECT name
--,database_name
--,success_count
--,start_time
,CONVERT(DATE, start_time) AS Dating
,DATEPART(HOUR, start_time) AS Houring
,sum(CASE
WHEN name = database_name
THEN success_count
ELSE 0
END) AS Count_Connections
FROM sys.database_connection_stats
CROSS JOIN sys.databases
WHERE start_time > #StartDate
AND database_id != 1
GROUP BY name
,CONVERT(DATE, start_time)
,DATEPART(HOUR, start_time)
) AS t
GROUP BY t.name) avcon on avcon.name = rs1.database_name
WHERE start_time > #StartDate
GROUP BY database_name, sysso.edition, sysso.service_objective,avcon.AVG_Connections_per_Hour
ORDER BY database_name , sysso.edition, sysso.service_objective
The query will return you statistics for all the databases on the server.
AVG_Connections_per_Hour: collects data of the last 30 days
All AVG and MAX statistics: collects data of the last 14 days

Pick a provider, and host the workloads. Under demand: provide fan-out among the cloud providers when needed.
This solution requires minimal transfer.

You could perhaps denormalise your needed data and store it in ClickHouse? It's a fast noSQL database for online analytical processing meaning that you can run queries which compute discount on the fly and it's very fast millions to billions rows per second. You will query using their custom SQL which is intuitive, powerful and can be extended with Python/C++.
You can try doing it like you did it before but with ClickHouse but opt in for a distributed deployment.
"Doing it in one SQL database where each tenant is separated with tenant id"
The deployment of Clickhouse cluster can be done on Kubernetes using the Altinity operator, it's free and you only have to pay for the resources, paid or managed options are also available.
ClickHouse also supports lots of integrations which means that you can perhaps stream data into it from Kafka or RabbitMQ or from local files/S3 Files
I've been running a test ClickHouse cluster with 150M rows and 70 columns mostly int64 fields. A DB query with 140 filters on all the columns took about 7-8 seconds on light load and 30-50s on heavy load. The cluster had 5 members (2 shards, 3 replicas).
Note: I'm not affiliated with ClickHouse, I just like the database. You could try to find another OLAP alternative on Azure.

Azure Cosmos SQL queries very slow, possible to optimize performance?

Querying a collection that holds about 250k items and I need to pull back most of it. These items are partitioned by weeknumber. I've tried slicing the query up into many smaller queries but it seems the fastest I can have it return data is ~500 items/sec. I only want items that have a date within the past 3 months. The querying is done from my Vue web app client side. Any links or suggestions are welcome.
SELECT
c.weeknumber,
c.item_txt,
c.item,
c.trandate,
c.trandate_unix,
c.trandate_iso,
StringToNumber(c.TOTAL_QTY) AS total_qty,
StringToNumber(c.MA_TOTAL_QTY) AS ma_total_qty,
StringToNumber(c.IN_TOTAL_QTY) AS in_total_qty,
StringToNumber(c.CA_TOTAL_QTY) AS ca_total_qty
FROM c
WHERE c.trandate_unix >= ${threeMoAgo_start / 1000} AND c.trandate_unix <= ${threeMoAgo_end / 1000}

Node JS Postgres EXISTS and SELECT 1 Queries

I'm trying to determine whether a user already exists in my database. I know one way to do this is by running:
SELECT * FROM users WHERE email = $1
and checking if the number of rows is greater than 0 or now. However, I know that a more efficient way to run this command is by using the "EXISTS" keyword because it doesn't need to run through all the rows in the database. However, running
EXISTS (SELECT 1 FROM users WHERE email = $1)
yields
error: syntax error at or near "EXISTS"
I've also tried simply running
SELECT 1 FROM users WHERE email = $1
as this should have the same efficiency optimizations but it doesn't output any row data.
I'm using the "pg" driver. Any help is greatly appreciated. Thank you in advance!

EXISTS(...) is an operator, not a command :
SELECT EXISTS (SELECT 1 FROM users WHERE email = $1) AS it_does_exist;
EXISTS(...) yields a boolean; its argument is some kind of expression: (most often) a table-expression.
EXISTS(...) only checks its argument for non-emptiness; EXISTS(SELECT * FROM users WHERE email = $1); would give the same result.

You won't gain much performance by omitting columns from the select list unless your table has many or some of them contain a lot of data. A little benchmark on localhost using python, averaged over 1000 query executions...
57 µs SELECT * FROM foo WHERE login=%s
48 µs SELECT EXISTS(SELECT * FROM foo WHERE login=%s)
40 µs SELECT 1 FROM foo WHERE login=%s
26 µs EXECUTE myplan(%s) -- using a prepared statement
And the same over a gigabit network:
499 us SELECT * FROM foo WHERE login=%s
268 us SELECT EXISTS(SELECT * FROM foo WHERE login=%s)
272 us SELECT 1 FROM foo WHERE login=%s
278 us EXECUTE myplan(%s)
Most of that is network latency, which is variable, which means it's difficult to tell which query executes fastest on the database itself when benchmarking it over a network with such very small queries. If the queries took longer, that would be another story.
It also shows that ORMs and libraries that use prepared statements and do several network roundtrips to prepare then execute queries should be avoided like the plague. Postgres has a specific protocol PQexecParams using bind+execute in a single message to avoid this and get the advantages of prepared queries (no SQL injection) with no speed penalty. The client library should use it. If you do a lot of small queries this could be significant.

Unable to delete large number of rows from Spanner

I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:
CREATE TABLE predictions (
name STRING(MAX),
...,
model_version INT64,
) PRIMARY KEY (name, model_version)
I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:
DELETE FROM predictions WHERE model_version <> ?
According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.
# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
"DELETE FROM predictions WHERE model_version <> #version",
params={"model_version": 104},
param_types={"model_version": Type(code=INT64)},
)
My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/
I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:
def delete_old_rows(transaction, model_version):
delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
dml_statements = [
delete_dml,
]
status, row_counts = transaction.batch_update(dml_statements)
database.run_in_transaction(delete_old_rows,
model_version=104,
timeout_secs=3600,
)
What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.
Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance

The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.
We will update the docs for run_in_transaction to explain this better.

The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)
This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.
As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)
final RetrySettings retrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofMinutes(60L))
.setMaxRpcTimeout(Duration.ofMinutes(60L))
.setMaxAttempts(1)
.setTotalTimeout(Duration.ofMinutes(60L))
.build();
SpannerOptions.Builder builder =
SpannerOptions.newBuilder()
.setProjectId("[PROJECT]"));
builder
.getSpannerStubSettingsBuilder()
.applyToAllUnaryMethods(
new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
#Override
public Void apply(Builder<?, ?> input) {
input.setRetrySettings(retrySettings);
return null;
}
});
builder
.getSpannerStubSettingsBuilder()
.executeStreamingSqlSettings()
.setRetrySettings(retrySettings);
builder
.getSpannerStubSettingsBuilder()
.streamingReadSettings()
.setRetrySettings(retrySettings);
Spanner spanner = builder.build().getService();

The first suggestion is to try gcloud instead.
https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml
Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.
Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.

Performance issues in Sybase 12.5 to Sybase 15 migration

We are in the process of migrating our DB to Sybase 15. The stored procedures which were working fine in Sybase 12.5 have a poor performance in Sybase 15. However when we add 'set merge_join off' Syabse 15 performs faster. Is there any way to use the sybase 12.5 stored procs as it is in Sybase 15 / or with minimal changes? Do we have any alternate ways apart from rewriting the whole stored proc?

I think this depends on how much time and energy you have to investigate Sybase 15 and use its new optimisers.
If this is a small app and you just want it working without clueing up on some or all of the new optimisers, index statistics, datachange, login triggers, then use either compatibility mode or maybe better, restrict the Optimiser to be allrows_oltp, avoiding dss and mix (which would use hash joins and merge joins respectively.)
If it's a big system and you have time, I think you should find out about the above, allow at least mix if not dss too, and make sure you
have index statistics up to date (much more important to have stats on 2nd and subsequent cols of indexes to opimise right for merge and hash joins.)
understand DATACHANGE (to find tables that need stats updates.)
login triggers (can be v useful to configure some sessions/users down or up optimisation levels - see sypron website for Rob Verschoor's write-up.)
make sure you've got access to sp_showplan (use a tool, or get sa_role, or use Rob Verschoor's CIS technique to grant.)
The new optimisers are good, but I think it's true to say that they take time and energy to understand and make work. If you don't have time and energy and don't need the extra performance, just stick to allrows_oltp, or even compatibility mode (I don't have experience of the latter, but somehow it seems wrong to me.)

There is compatibility mode in sybase 15.
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc00967.1550/html/MigrationGuide/CBHJACAF.htm

I would say try to find root cause of issue, We too had a issue with one of our procs where timing went up from 27 mins to 40 mins. When diagnosed and fixed proc just took 6 mis to complete (which was 27 mins). ASE15 optimizer and query processing is much better then 12.5.
If you dont have time just set the compatibility mode at session level for this proc.
"set compatibility_mode on"
But do compare the results.
Additionally if you have time do try using DBCC (302,310) and 3604 (for redirection) to understand why optimizer is using such LAVA operator.
Excellent Article by Rob V

Sybase 15 optimizer uses more algorithm of joins i.e Merge Join, Hash join, nested loop join, etc
Where as in Sybase 12.5, the most used algorithm for join is Nested loop join.
Apart from switching the compatibility mode on (This will use Sybase 12.5 optimizer and wont give you any benefits of Sybase 15 optimizer), you can play with various optimization goals.
In your case I suggest you set the optimization goal to "allrows_oltp", which will only use nested loop joins in your queries, at server level.
-- server-wide default:
sp_configure 'optimization goal', 0, 'allrows_oltp'
-- session-level setting (overrides server-wide setting):
set plan optgoal allrows_oltp
-- query-level setting (overrides server-wide and session-level settings):
select * from T1, T2 where T1.a = T2.b plan '(use optgoal allrows_oltp)'
allrows_oltp resembles Sybase 12.5 way very closely, and should be tried first before trying any other optimization goals.
Note: After setting to allrows_oltp, do proper testing to see if any other query got affected by this
More info about optimization goals can be found here

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string