Request rate is large for remove request - node.js

When I try to find some docs in documentDB, all is good -
collection.find({query})
when I try to remove, all is bad
[mongooseModel | collection].remove({same-query})
I got
Request rate is large
The number of documents to remove ~ 10 000 . I have tested queries in robomongo shell, which limits find results to 50 per page. Also my remove query fails with mongoose. I can't understand such behavior. How can I got in to Request Limit while remove query is a single request?
Update
Count with query also raise same error.
db.getCollection('taxonomies').count({query})

"Request rate is large" indicates that the application has exceeded the provisioned RU quota, and should retry the request after a small time interval.
Since you are using DocumentDB Node.js API, you could check out #Larry Maccherone's answer in Request rate is large on how to avoid this issue by handling retry behavior and logic in your application's error handling routines.
More on this:
Dealing with RequestRateTooLarge errors in Azure DocumentDB and testing performance
Request Units in Azure Cosmos DB

This is an awful part of CosmosDB - I don't see how they intend to become a real player in the cloud database game with limitations like this. That being said - I came up with a hack to delete all records from a collection through the MongoDB API when you are bumping up against the "Rate Exceeded" error. See below:
var loop = true
while(loop){
try {
db.grid.deleteMany({"myField":"my-query"})
}
catch (err) {
print(err)
printjson(db.runCommand({getLastRequestStatistics:1}))
continue
}
loop = false
}
Just update the deleteMany query and run this script directly in your mongo shell and it should loop through and delete the records.

Related

AWS DocumentDB Performance Issue with Concurrency of Aggregations

I'm working with DocumentDB in AWS, and I've been having troubles when I try to read from the same collection simultaneously from different aggregation queries.
The issue is not that I cannot read from the database, but rather that it takes a lot of time to complete the queries. It doesn't matter if I trigger the queries simultaneously or one after the other.
I'm using a Lambda Function with NodeJS to run my code. And I'm using mongoose to handle the connection with the database.
Here's a sample code that I put together to illustrate my problem:
query1() {
return Collection.aggregate([...])
}
query2() {
return Collection.aggregate([...])
}
query3() {
return Collection.aggregate([...])
}
It takes the same time if I run it using Promise.all
Promise.all([ query1(), query2(), query3() ])
Than if I run it waiting for the previous one to finish
query1().then(result1 => query2().then(result3 => query3()))
While if I run each query in different Lambda Executions, it takes significantly less time for each individual query to finish (Between 1 and 2 seconds).
So if they were running in parallel the execution should be finished with the time of the query that takes the most time (2 seconds), and not take 7 seconds, as it does now.
So my guessing is that the instance of DocumentDB is running the queries in sequence no matter how I send them. In the collection there are around 19,000 documents with a total size of almost 25Mb.
When I check the metrics of the instance, the CPUUtilization is barely over 8% and the RAM available only drops by 20Mb. So I don't think the problem of the delay has to do with the size of the instance.
Do you know why DocumentDB is behaving like this? Is there a configuration that I can change to run the aggregations in parallel?

How to write more than 500 documents using batch in cloud firestore? [duplicate]

This question already has answers here:
How can I update more than 500 docs in Firestore using Batch?
(8 answers)
Closed 2 years ago.
I am trying to update more than 500 documents like 1000 to 2000 documents and i only know how to use batch for 500 documents i wanted to ask how can i update more than 500 documents using cloud firestore.here is how i am trying to update 500 documents. i am trying to update ms_timestamp for 1000 documents. can anyone tell me how can I do it using batch write
const batch = db.batch();
const campSnapshot = await db
.collection("camp")
.where("status", "in", ["PENDING", "CONFIRMED"])
.get();
await db.collection("camping").doc(getISO8601Date()).set({
trigger: campSnapshot.docs.length,
});
campSnapshot.forEach((docs) => {
const object = docs.data();
object.ms_timestamp = momentTz().tz("Asia/Kolkata").valueOf();
batch.set(
db.collection("camp").doc(docs.get("campId")),
object,
{ merge: true }
);
});
await Promise.all([batch.commit()]);
Cloud Firestore imposes a limit of 500 documents when performing a Transaction or Batched Write, and you can not change this, but a workaround may just work.
I am not an expert in web dev, so I am sharing a suggestion based on my viewpoint as a mobile app developer.
Create a collection that store a counter on how many documents that are contained within a specific collection. I update the counter through Cloud Functions/other approach, when an event (be it created, updated, or deleted) is fired within that specific collection. The counter should be atomic and consistent, and you can leverage Cloud Firestore Transaction here.
Fetch the counter value before performing batched writes. Here, I will know how many data/objects/documents that need to be updated.
Create an offset with initial value is 0. The offset is used to mark the data. A batched writes can only be performed for up to 500 documents, so if I want to perform a batched write again on document/data at 501-1000, then the offset will be 500, and so on.
Call a method that perform batched writes recursively using the defined offset, until it fully equals the counter - 1.
I do not test this since I don't have enough time right now, but I think it'll work.
You can comment if you still not understand, I'll be glad to help further.

Overcoming Azure Vision Read API Transactions-Per-Second (TPS) limit

I am working on a system where we are calling Vision Read API for extracting the contents from raster PDF. Files are of different sizes, ranging from one page to several hundred pages.
Files are stored in Azure Blob and there will be a function to push files to Read API once when all files are uploaded to blob. There could be hundreds of files.
Therefore, when the process starts, a large number of documents are expected to be sent for text extraction per second. But Vision API has limit of 10 transactions per second including read.
I am wondering what would be best approach? Some type of throttling or queue?
Is there any integration available (say with queue) from where the Read API will pull documents and is there any type of push notification available to notify about completion of read operation? How can I prevent timeouts due to exceeding 10 TPS limit?
Per my understanding , there are 2 key points you want to know :
How to overcome 10 TPS limit while you have lot of files to read.
Looking for a best approach to get the Read operation status and
result.
Your question is a bit broad,maybe I can provide you with some suggestions:
For Q1, Generally ,if you reach TPS limit , you will get a HTTP 429 response , you must wait for some time to call API again, or else the next call of API will be refused. Usually we retry the operation using something like an exponential back off retry policy to handle the 429 error:
2.1) You need check the HTTP response code in your code.
2.2) When HTTP response code is 429, then retry this operation after N seconds which you can define by yourself such as 10 seconds…
For example, the following is a response of 429. You can set your wait time as (26 + n) seconds. (PS: you can define n by yourself here, such as n = 5…)
{
"error":{
"statusCode": 429,
"message": "Rate limit is exceeded. Try again in 26 seconds."
}
}
2.3) If step 2 succeed, continue the next operation.
2.4) If step 2 fail with 429 too, retry this operation after N*N seconds (you can define by yourself too) which is an exponential back off retry policy..
2.5) If step 4 fail with 429 too, retry this operation after NNN seconds…
2.6) You should always wait for current operation to succeed, and the Waiting time will be exponential growth.
For Q2,, As we know , we can use this API to get Read operation status/result.
If you want to get the completion notification/result, you should build a roll polling request for each of your operation at intervals,i.e. each 10 seconds to send a check request.You can use Azure function or Azure automation runbook to create asynchronous tasks to check read operation status and once its done , handle the result based on your requirement.
Hope it helps. If you have any further concerns , please feel free to let me know.

Unable to delete large number of rows from Spanner

I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:
CREATE TABLE predictions (
name STRING(MAX),
...,
model_version INT64,
) PRIMARY KEY (name, model_version)
I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:
DELETE FROM predictions WHERE model_version <> ?
According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.
# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
"DELETE FROM predictions WHERE model_version <> #version",
params={"model_version": 104},
param_types={"model_version": Type(code=INT64)},
)
My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/
I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:
def delete_old_rows(transaction, model_version):
delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
dml_statements = [
delete_dml,
]
status, row_counts = transaction.batch_update(dml_statements)
database.run_in_transaction(delete_old_rows,
model_version=104,
timeout_secs=3600,
)
What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.
Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance
The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.
We will update the docs for run_in_transaction to explain this better.
The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)
This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.
As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)
final RetrySettings retrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofMinutes(60L))
.setMaxRpcTimeout(Duration.ofMinutes(60L))
.setMaxAttempts(1)
.setTotalTimeout(Duration.ofMinutes(60L))
.build();
SpannerOptions.Builder builder =
SpannerOptions.newBuilder()
.setProjectId("[PROJECT]"));
builder
.getSpannerStubSettingsBuilder()
.applyToAllUnaryMethods(
new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
#Override
public Void apply(Builder<?, ?> input) {
input.setRetrySettings(retrySettings);
return null;
}
});
builder
.getSpannerStubSettingsBuilder()
.executeStreamingSqlSettings()
.setRetrySettings(retrySettings);
builder
.getSpannerStubSettingsBuilder()
.streamingReadSettings()
.setRetrySettings(retrySettings);
Spanner spanner = builder.build().getService();
The first suggestion is to try gcloud instead.
https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml
Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.
Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.

Bigquery API Intermittently returns http error 400 "Bad Request"

I am getting http error 400 returns intermittently for a particular query, yet when I examine the text of the query it appears to be correct, and if I then copy the query to the Bigquery GUI and run it, it executes without any problems. The query is being constructed in node.js and submitted though the gcloud node.js api. The response I receive, which contains the text of the query is too large to post here, but I do have the path name:
"pathname":"/bigquery/v2/projects/rising-ocean-426/queries/job_aSR9OCO4U_P51gYZ2xdRb145YEA"
The error seems to occur only if the live_seconds_viewed calculations are included in the query. If any part of the live_seconds_viewed calculation is included then the query fails intermittently.
The initial calculation of this field is:
CASE WHEN event = 'video_engagement'
AND range IS NULL
AND INTEGER(video_seconds_viewed) > 0
THEN 10
ELSE 0 END AS live_seconds_viewed,
Sometimes I can get the query to execute simply by changing the order of the expressions. But again, it is intermittent.
Any help with this would be greatly appreciated.
After long and arduous trial and error, I've determined that the reason why the query is failing is simply that the string length of the query is too long. When the query is executed from the GUI, apparently the white space is stripped so the query executes because without the white space it is short enough to pass the size limit.
When I manipulated the query to determine what part or parts were causing the problem, I would inadvertently reduce the size of the query below the critical limit and cause the query to pass.
It would be great if the error response from Bigquery included some hint about what the problem is rather than firing off a 400 error bad request and calling it quits.
It would be even better if the Bigquery parser would ignore white space when determining the size of the query. In this way the behavior on the GUI would match the behavior when submitting the query through the API.

Resources