Pymongo query significantly slower through AWS Lambda

Pymongo query significantly slower through AWS Lambda - python-3.x

When I run a query through Robo 3T, mongod, or pymongo using the command line, the query takes 50 milliseconds to return the results. Running the same query using pymongo in AWS Lambda takes 15-16 seconds. All queries are not this slow, just address queries in my case (queries for a name take under 1 second). I'm using python 3.6, pymongo 3.7, and mongodb 3.6.
I don't believe it's a cold start issue because I run 2 queries in a row, this being the second query, and the first query still takes less than a second. Also, I've tried running it multiple times in a row and I get the same results every time. The function only uses up 57MB of the 128MB allotted, so I don't believe it could be a CPU issue, and increasing the CPU didn't change the speed at all.
MongoDB query
db.getCollection('CA').find({'$and': [{'OWNER_ZIP': {'$regex': '^95120'}},{'OWNER_STREET_1': '123 MAIN STREET'}]})
pymongo query
cursor = list(db_unclaimed.find({'$and': [{'OWNER_ZIP': {'$regex': '^95120'}},{'OWNER_STREET_1': '123 MAIN STREET'}]}).skip(0).limit(50))
Python function I'm using
def searchAddress(zipcode, address, page_size, page_num):
print('Searching by address...')
print(address)
skips = page_size * (int(page_num) - 1)
cursor = list(db_unclaimed.find({'$and': [{'OWNER_ZIP': {'$regex': zipcode}},{'OWNER_STREET_1': address}]}).skip(skips).limit(page_size))
for document in cursor: print(document)
return cursor
I would expect the query to take close to the same amount of time in lambda that it does using the other 3 methods, even if it might a bit slower. Does anyone have any ideas as to what could be causing this?

Related

AWS DocumentDB Performance Issue with Concurrency of Aggregations

I'm working with DocumentDB in AWS, and I've been having troubles when I try to read from the same collection simultaneously from different aggregation queries.
The issue is not that I cannot read from the database, but rather that it takes a lot of time to complete the queries. It doesn't matter if I trigger the queries simultaneously or one after the other.
I'm using a Lambda Function with NodeJS to run my code. And I'm using mongoose to handle the connection with the database.
Here's a sample code that I put together to illustrate my problem:
query1() {
return Collection.aggregate([...])
}
query2() {
return Collection.aggregate([...])
}
query3() {
return Collection.aggregate([...])
}
It takes the same time if I run it using Promise.all
Promise.all([ query1(), query2(), query3() ])
Than if I run it waiting for the previous one to finish
query1().then(result1 => query2().then(result3 => query3()))
While if I run each query in different Lambda Executions, it takes significantly less time for each individual query to finish (Between 1 and 2 seconds).
So if they were running in parallel the execution should be finished with the time of the query that takes the most time (2 seconds), and not take 7 seconds, as it does now.
So my guessing is that the instance of DocumentDB is running the queries in sequence no matter how I send them. In the collection there are around 19,000 documents with a total size of almost 25Mb.
When I check the metrics of the instance, the CPUUtilization is barely over 8% and the RAM available only drops by 20Mb. So I don't think the problem of the delay has to do with the size of the instance.
Do you know why DocumentDB is behaving like this? Is there a configuration that I can change to run the aggregations in parallel?

Slow Insert into Azure SQL database from python list of rows

Hi everyone I am struggling with this problem: I am trying to insert in an azure db a python list made of approx 100k rows, using this code:
list_of_rows = [...]
self.azure_cursor.fast_executemany = True
self.azure_cursor.executemany('''INSERT INTO table_name VALUES(?,?,?,?,?,?,?,?,?,?,?)''',list_of_rows)
The problem is that it takes ages to do that (about 43 seconds for 100k rows, for an amount of less than 30 MB of data) and I don't know how to improve it, because I am already using fast_executemany and as seen from azure dashboard I don't reach max DTU granted by my subscription plan (S1-20 DTU).
I've also tried to see if an index would help, but there are no advantages (and trying to run the query in SSMS no index is recommended).
Finally, the problem is not about connection, since I am using 1Gb/s download/upload
Does someone know how to improve these performances?
UPDATE
Tried to use the code below as suggested in the page linked by by Shiraz Bhaiji:
Firstly, I create a pandas dataframe from my list of rows, then set up the engine and create the event listener and then I use df.to_sql
self.df = pd.DataFrame(data = list_of_rows , columns=['A','B','C'])
params='DRIVER=driver;SERVER=server;PORT=1433;DATABASE=databas;UID=username;PWD=password'
db_params = urllib.parse.quote_plus(params)
self.engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect={}".format(db_params))
#event.listens_for(self.engine, "before_cursor_execute")
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
df.to_sql('table_name', self.engine, index=False, if_exists="append", schema="dbo")
The code below takes the same time as pure executemany. I tried to remove PK (there are no other indexes on the table) and it made insert faster, now it takes 22 seconds, but is too much for 100k rows for a total amount of 30 MB of data

If you use the to_sql function instead, you can speed up the inserts.
See: https://medium.com/analytics-vidhya/speed-up-bulk-inserts-to-sql-db-using-pandas-and-python-61707ae41990

Slow Document Insert MongoDB

I am using insertMany to insert about 300 documents at a time from Lamda (AWS) to MongoDB Atlas. We are using NodeJS and Mongoose. The server seems to max out at 3% CPU so I don't believe the problem described below is harware related. There are no performance suggestions from atlas either.
The issue we are having is that the time it takes to insert 300 documents is between 27 - 30+ seconds. When using aws Lamda anything over 30 seconds causes a time out.
I feel like we must be doing something incorrectly as 30 plus seconds seems to be a very long time. Each document is only 7KB.
The indexing is done with a timestamp and a string (like a URL) and unique timestamp and string combination.
There are 13,500 documents in the collection.
Any ideas how to speed this up ?

Mongodb count performance issues with Node js

I am having issues with doing counts on a single table with up to 1million records. I have a 32 core 244gb ram box that I am running my test on so hardware should not be an issue.
I have indexes set up on all of my queries that I am using to perform counts. I have enabled node max_old_space_size to 15gb.
The process I am following is basically looping through a huge array, creating 1000 promises, within each promise I am performing 12 counts, waiting for the promises to all resolve, and then continuing with the next one thousand batch.
As part of my test, I am doing inserts, updates, and reads as well. All of those, are showing great performance up to 20000/sec on each. However, when I get to the portion of my code doing the counts(), I can see via mongostat that there are only 20-30 commands being executed per second. I have not determined at this point, if my node code is only sending that many, or if mongo is queuing it up.
Meanwhile, in my node.js code, all 1000 promises are started and waiting to evaluate. I know this is a lot of info, so please let me know what more granular details I should provide to get some more insight into why the count performance is so slow.
So basically, for a batch of 1000 records, doing lets say 12 counts each, for a total of 12,000 counts, it is taking close to 10 minutes, on a table of 1million records.
MongoDB Native Client v2.2.1
Node v4.2.1
What I'd like to add is that I have tried changing the maxPoolSize on the driver from 100-1000 with no change in performance. I've tried changing my queries that I perform from yield/generator/promise to callbacks wrapped in promise, which has helped somewhat.
The strange thing is, when my program starts, even if i use just the default number of connections which I see as 7 when running mongostat, I can get around 2500 count() queries per second throughout. However, after a few seconds this goes back down to about 300-400. This leads me to believe that mongo can handle that many all the time, but my code is not able to send that many requests, even though I set maxPoolSize to 300 and start 10000 simultaneous promises resolving in parallel. So what gives, any ideas from anyone ?

Alternatives to MongoDB cursor.toArray() in node.js

I am currently using MongoDB cursor's toArray() function to convert the database results into an array:
run = true;
count = 0;
var start = process.hrtime();
db.collection.find({}, {limit: 2000}).toArray(function(err, docs){
var diff = process.hrtime(start);
run = false;
socket.emit('result', {
result: docs,
time: diff[0] * 1000 + diff[1] / 1000000,
ticks: count
});
if(err) console.log(err);
});
This operation takes about 7ms on my computer. If I remove the .toArray() function then the operation takes about 0.15ms. Of course this won't work because I need to forward the data, but I'm wondering what the function is doing since it takes so long? Each document in the database simply consists of 4 numbers.
In the end I'm hoping to run this on a much smaller processor, like a Raspberry Pi, and here the operation where it fetches 500 documents from the database and converts it to an array takes about 230ms. That seems like a lot to me. Or am I just expecting too much?
Are there any alternative ways to get data from the database without using toArray()?
Another thing that I noticed is that the entire Node application slows remarkably down while getting the database results. I created a simple interval function that should increment the count value every 1 ms:
setInterval(function(){
if(run) count++;
}, 1);
I would then expect the count value to be almost the same as the time, but for a time of 16 ms on my computer the count value was 3 or 4. On the Raspberry Pi the count value was never incremented. What is taking so much CPU usage? The monitor told me that my computer was using 27% CPU and the Raspberry Pi was using 92% CPU and 11% RAM, when asked to run the database query repeatedly.
I know that was a lot of questions. Any help or explanations are much appreciated. I'm still new to Node and MongoDB.

db.collection.find() returns a cursor, not results, and opening a cursor is pretty fast.
Once you start reading the cursor (using .toArray() or by traversing it using .each() or .next()), the actual documents are being transferred from the database to your client. That operation is taking up most of the time.
I doubt that using .each()/.next() (instead of .toArray(), which—under the hood—uses one of those two) will improve the performance much, but you could always try (who knows). Since .toArray() will read everything in memory, it may be worthwhile, although it doesn't sound like your data set is that large.
I really think that MongoDB on Raspberry Pi (esp a Model 1) is not going to work well. If you don't depend on the MongoDB query features too much, you should consider using an alternative data store. Perhaps even an in-memory storage (500 documents times 4 numbers doesn't sound like lots of RAM is required).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string