Mongodb count performance issues with Node js - node.js

I am having issues with doing counts on a single table with up to 1million records. I have a 32 core 244gb ram box that I am running my test on so hardware should not be an issue.
I have indexes set up on all of my queries that I am using to perform counts. I have enabled node max_old_space_size to 15gb.
The process I am following is basically looping through a huge array, creating 1000 promises, within each promise I am performing 12 counts, waiting for the promises to all resolve, and then continuing with the next one thousand batch.
As part of my test, I am doing inserts, updates, and reads as well. All of those, are showing great performance up to 20000/sec on each. However, when I get to the portion of my code doing the counts(), I can see via mongostat that there are only 20-30 commands being executed per second. I have not determined at this point, if my node code is only sending that many, or if mongo is queuing it up.
Meanwhile, in my node.js code, all 1000 promises are started and waiting to evaluate. I know this is a lot of info, so please let me know what more granular details I should provide to get some more insight into why the count performance is so slow.
So basically, for a batch of 1000 records, doing lets say 12 counts each, for a total of 12,000 counts, it is taking close to 10 minutes, on a table of 1million records.
MongoDB Native Client v2.2.1
Node v4.2.1
What I'd like to add is that I have tried changing the maxPoolSize on the driver from 100-1000 with no change in performance. I've tried changing my queries that I perform from yield/generator/promise to callbacks wrapped in promise, which has helped somewhat.
The strange thing is, when my program starts, even if i use just the default number of connections which I see as 7 when running mongostat, I can get around 2500 count() queries per second throughout. However, after a few seconds this goes back down to about 300-400. This leads me to believe that mongo can handle that many all the time, but my code is not able to send that many requests, even though I set maxPoolSize to 300 and start 10000 simultaneous promises resolving in parallel. So what gives, any ideas from anyone ?

Related

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

AWS DocumentDB Performance Issue with Concurrency of Aggregations

I'm working with DocumentDB in AWS, and I've been having troubles when I try to read from the same collection simultaneously from different aggregation queries.
The issue is not that I cannot read from the database, but rather that it takes a lot of time to complete the queries. It doesn't matter if I trigger the queries simultaneously or one after the other.
I'm using a Lambda Function with NodeJS to run my code. And I'm using mongoose to handle the connection with the database.
Here's a sample code that I put together to illustrate my problem:
query1() {
return Collection.aggregate([...])
}
query2() {
return Collection.aggregate([...])
}
query3() {
return Collection.aggregate([...])
}
It takes the same time if I run it using Promise.all
Promise.all([ query1(), query2(), query3() ])
Than if I run it waiting for the previous one to finish
query1().then(result1 => query2().then(result3 => query3()))
While if I run each query in different Lambda Executions, it takes significantly less time for each individual query to finish (Between 1 and 2 seconds).
So if they were running in parallel the execution should be finished with the time of the query that takes the most time (2 seconds), and not take 7 seconds, as it does now.
So my guessing is that the instance of DocumentDB is running the queries in sequence no matter how I send them. In the collection there are around 19,000 documents with a total size of almost 25Mb.
When I check the metrics of the instance, the CPUUtilization is barely over 8% and the RAM available only drops by 20Mb. So I don't think the problem of the delay has to do with the size of the instance.
Do you know why DocumentDB is behaving like this? Is there a configuration that I can change to run the aggregations in parallel?

CPU / DTUs getting maxed out on Azure SQL Database, but top queries less than 1% and database only a few MB

I just launched an Azure SQL Database, and the DTU and CPU usage is behaving strangely. The database is only receiving about 30 requests per minute, and the CPU/DTU will be extremely low for hours, and then jump up to 100% and stay there (with no increase in the number of requests that triggers this). When I click to view the top queries, none of them are above 1% cpu usage. I started out on a 5 DTU plan, and yesterday upgraded to 20 DTUs and the same behavior is occurring. Any idea what else might cause the DTU/CPU to get maxed out? See images below:
https://i.imgur.com/LdbYTPw.png
https://i.imgur.com/jlus3FM.png
Thanks in advance for any advice!
Joe
EDIT: I'm getting closer, I found these repeated entries in the error log. (about 8 - 10 per SECOND)
"The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request."
The thing is, the App Service that queries the database is only doing simple selects, updates, and inserts... none of which uses any complex WHERE IN statement. Furthermore, every query is wrapped in a try/catch block, and I'm never seeing an exception like this.
Where could these large queries be originating from?
You are only seeing the CPU component of the DTU graph, what about the "Data IO" and the "Log IO" components? Look at the top 5 queries on the 3 sections, and let me know if you find a query that start with "SELECT Statman ...". If you see that, then the Auto Update Statistics process is creating those DTU spikes.
I would suggest to install the sp_whoisactive script so that you can see what's going on more easily:
http://whoisactive.com/

Why does DynamoDB performance decrease with parallel reads?

With AWS-XRay tracing enabled on my lambda function i've found that as the number of parallel requests increases to dynamodb the performance of the read's decreases.
Here is an example of the XRay Traces:
Above you can see that the first set of GetItem requests execute in under 300ms. This set only has 6 async read requests running in parallel. The next set of read requests all execute in on average atleast 1.5 seconds - with 57 async read requests running in parallel.
Thoughts on what this could be due to:
this may be due to a "cold start" feature as dynamodb adds capacity to deal with parallel reads? (This dyanmodb instance is pay-by-request, not provisioned)
Additionally, i recognize that this may not be related parallel requests at all, but it may be a good place to start asking questions. Wondering if anyone knows what could be causing such a dramatic performance decrease.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Resources