Count(*) on a SimpleDB table of millions of entries - amazon

How long should it take to get a response for the statement SELECT count(*) FROM db_name on a SimpleDB table of millions of entries? (currently my table >16M).
Shouldn't there some sort of "pagination" using the next_token parameter if the operation takes too long? (it's been hanging there for minutes now!)

There's something wrong. No count query will take more than 5 seconds, because after 5 seconds it cuts off and gives you a next token.
If the count request takes more than five seconds, Amazon SimpleDB returns the number of items that it could count and a next token to return additional results. The client is responsible for accumulating the partial counts.
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/CountingDataSelect.html

SimpleDB responses are typically under 200ms, not counting data transfer speed (from Amazon's server to yours, which is less than 50ms if you're on EC2).
However, the total size of a SimpleDB response cannot exceed 2,500 rows or 1MB, whichever is smaller.
See "Limit" here
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/index.html?UsingSelect.html

Related

AWS Maximum BadRequestException retries reached for query Using Data API to Query RDS Serveless Aurora

I have created a Lambda function that uses awswrangler data api to read in data from an RDS Serverless Aurora PostgreSQL Database from a query. The query contains a conditional that is a list of IDs. If the query has less then 1K ids it works great, if over 1K I get this message:
Maximum BadRequestException retries reached for query
An example query is:
"""select * from serverlessDB where column_name in %s""" % ids_list
I adjusted the serverless RDS instance to force scaling as well as increased the concurrency on the lambda function. Is there a way fix this issue?
With PostgreSQL, 1000+ items in a WHERE IN clause should be just fine.
I believe you are running into a Data API limit.
There isn't a fixed upper limit on the number of parameter sets. However, the maximum size of the HTTP request submitted through the Data API is 4 MiB. If the request exceeds this limit, the Data API returns an error and doesn't process the request. This 4 MiB limit includes the size of the HTTP headers and the JSON notation in the request. Thus, the number of parameter sets that you can include depends on a combination of factors, such as the size of the SQL statement and the size of each parameter set.
The response size limit is 1 MiB. If the call returns more than 1 MiB of response data, the call is terminated.
The maximum number of requests per second is 1,000./
Either your request exceeds 4 MB or the response of your query exceeds 1 MB. I suggest you split your query into multiple smaller queries.
Reference: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html

Calculating limit in Cosmos DB [duplicate]

I have a cosmosGB gremlin API set up with 400 RU/s. If I have to run a query that needs 800 RUs, does it mean that this query takes 2 sec to execute? If i increase the throughput to 1600 RU/s, does this query execute in half a second? I am not seeing any significant changes in query performance by playing around with the RUs.
As I explained in a different, but somewhat related answer here, Request Units are allocated on a per-second basis. In the event a given query will cost more than the number of Request Units available in that one-second window:
The query will be executed
You will now be in "debt" by the overage in Request Units
You will be throttled until your "debt" is paid off
Let's say you had 400 RU/sec, and you executed a query that cost 800 RU. It would complete, but then you'd be in debt for around 2 seconds (400 RU per second, times two seconds). At this point, you wouldn't be throttled anymore.
The speed in which a query executes does not depend on the number of RU allocated. Whether you had 1,000 RU/second OR 100,000 RU/second, a query would run in the same amount of time (aside from any throttle time preventing the query from running initially). So, aside from throttling, your 800 RU query would run consistently, regardless of RU count.
A single query is charged a given amount of request units, so it's not quite accurate to say "query needs 800 RU/s". A 1KB doc read is 1 RU, and writing is more expensive starting around 10 RU each. Generally you should avoid any requests that would individually be more than say 50, and that is probably high. In my experience, I try to keep the individual charge for each operation as low as possible, usually under 20-30 for large list queries.
The upshot is that 400/s is more than enough to at least complete 1 query. It's when you have multiple attempts that combine for overage in the timespan that Cosmos tells you to wait some time before being allowed to succeed again. This is dynamic and based on a more or less black box formula. It's not necessarily a simple division of allowance by charge, and no individual request would be faster or slower based on the limit.
You can see if you're getting throttled by inspecting the response, or monitor by checking the Azure dashboard metrics.

How Mongodb queries works in node.js?

I have a very easy search query in Node.js Express.js MongoDB with Mongoose:
await Model.find({}).limit(10);
My question is how do the architects work? Is it first to get all Models Data and then limit to 10 or before getting all data will select 10 items from the database? I mean the steps:
Find all data from Model and return as List(Array) --> 2. Limit 10 first items and remove others from List(Array).
Find first 10 items and return as List(Array)
The difference in performance is high cause with first step if we got a million data it will return 1 mill items with a huge 10 20 sec and then limiting the 10 of it which we loose 10 20 seconds of time and when the user are more the server will be done but with the second way even with 100 mil items it will always take same time.
The limit function sets specifies the maximum number of elements a cursor will return. In the case of your example, the cursor will return the first 10 items matching the query only (option 2). You can find more information on how the cursor.limit() works via the links below:
https://docs.mongodb.com/manual/reference/method/cursor.limit/
http://mongodb.github.io/node-mongodb-native/3.5/api/Cursor.html#limit

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Strange data access time in Azure Table Storage while using .Take()

this is our situation:
We store user messages in table Storage. The Partition key is the UserId and the RowKey is used as a message id.
When a users opens his message panel we want to just .Take(x) number of messages, we don't care about the sortOrder. But what we have noticed is that the time it takes to get the messages varies very much by the number of messages we take.
We did some small tests:
We did 50 * .Take(X) and compared the differences:
So we did .Take(1) 50 times and .Take(100) 50 times etc.
To make an extra check we did the same test 5 times.
Here are the results:
As you can see there are some HUGE differences. The difference between 1 and 2 is very strange. The same for 199-200.
Does anybody have any clue how this is happening? The Table Storage is on a live server btw, not development storage.
Many thanks.
X: # Takes
Y: Test Number
Update
The problem only seems to come when I'm using a wireless network. But I'm using the cable the times are normal.
Possibly the data is collected in batches of a certain number x. When you request x+1 rows, it would have to take two batches and then drop a certain number.
Try running your test with increments of 1 as the Take() parameter, to confirm or dismiss this assumption.

Resources