Optimise Pymongo query time - python-3.x

I am running a query on mongo db and looking for a solution(s) to optimise the time taken.
my query is like collection.find({'nameId':989080880,'Date':{'$gte':startDate}})
what I did is as below
pd.DataFrame(collection.find({'nameId':989080880,'Date':{'$gte':startDate}}))
this query took: x ms
then i tried
document=[]
for doc in collection.find({'nameId':989080880,'Date':{'$gte':startDate}}):
document.append(doc)
but it gave only a 15% improvement over x ms
Can not index as 'nameId' is a longinteger and indexing will require much more RAM etc.
looking forward to some suggestions

First is to optimize list using list comprehension, sth like:
document=[doc for doc in collection.find({'nameId':989080880,'Date':{'$gte':startDate}})]
Second, If you don't need all the field in your doc, just use projection, it an object containing all the fields u need to return.
with projection = { name: 1, address: 1} in your find query, all your documents only return name, address field, and also _id.
For more information, you can read here

Related

How to get Salesforce REST API to paginate?

I'm using the simple_salesforce python wrapper for the Salesforce REST API. We have hundreds of thousands of records, and I'd like to split up the pull of the salesforce data so all records are not pulled at the same time.
I've tried passing a query like:
results = salesforce_connection.query_all("SELECT my_field FROM my_model limit 2000 offset 50000")
to see records 50K through 52K but receive an error that offset can only be used for the first 2000 records. How can I use pagination so I don't need to pull all records at once?
Your looking to use salesforce_connection.query(query=SOQL) and then .query_more(nextRecordsUrl, True)
Since .query() only returns 2000 records you need to use .query_more to get the next page of results
From the simple-salesforce docs
SOQL queries are done via:
sf.query("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
If, due to an especially large result, Salesforce adds a nextRecordsUrl to your query result, such as "nextRecordsUrl" : "/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the additional results with either the ID or the full URL (if using the full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
Here is an example of using this
data = [] # list to hold all the records
SOQL = "SELECT my_field FROM my_model"
results = sf.query(query=SOQL) # api call
## loop through the results and add the records
for rec in results['records']:
rec.pop('attributes', None) # remove extra data
data.append(rec) # add the record to the list
## check the 'done' attrubite in the response to see if there are more records
## While 'done' == False (more records to fetch) get the next page of records
while(results['done'] == False):
## attribute 'nextRecordsUrl' holds the url to the next page of records
results = sf.query_more(results['nextRecordsUrl', True])
## repeat the loop of adding the records
for rec in results['records']:
rec.pop('attributes', None)
data.append(rec)
Looping through the records and using the data
## loop through the records and get their attribute values
for rec in data:
# the attribute name will always be the same as the salesforce api name for that value
print(rec['my_field'])
Like the other answer says though, this can start to use up a lot of resources. But it what you're looking for if want to achieve page nation.
Maybe create a more focused SOQL statement to get only the records needed for your use case at that specific moment.
LIMIT and OFFSET aren't really meant to be used like that, what if somebody inserts or deletes a record on earlier position (not to mention you don't have ORDER BY in there). SF will open a proper cursor for you, use it.
https://pypi.org/project/simple-salesforce/ docs for "Queries" say that you can either call query and then query_more or you can go query_all. query_all will loop and keep calling query_more until you exhaust the cursor - but this can easily eat your RAM.
Alternatively look into the bulk query stuff, there's some magic in the API but I don't know if it fits your use case. It'd be asynchronous calls and might not be implemented in the library. It's called PK Chunking. I wouldn't bother unless you have millions of records.

mongoose query using sort and skip on populate is too slow

I'm using an ajax request from the front end to load more comments to a post from the back-end which uses NodeJS and mongoose. I won't bore you with the front-end code and the route code, but here's the query code:
Post.findById(req.params.postId).populate({
path: type, //type will either contain "comments" or "answers"
populate: {
path: 'author',
model: 'User'
},
options: {
sort: sortBy, //sortyBy contains either "-date" or "-votes"
skip: parseInt(req.params.numberLoaded), //how many are already shown
limit: 25 //i only load this many new comments at a time.
}
}).exec(function(err, foundPost){
console.log("query executed"); //code takes too long to get to this line
if (err){
res.send("database error, please try again later");
} else {
res.send(foundPost[type]);
}
});
As was mentioned in the title, everything works fine, my problem is just that this is too slow, the request is taking about 1.5-2.5 seconds. surely mongoose has a method of doing this that takes less to load. I poked around the mongoose docs and stackoverflow, but didn't really find anything useful.
Using skip-and-limit approach with mongodb is slow in its nature because it normally needs to retrieve all documents, then sort them, and after that return the desired segment of the results.
What you need to do to make it faster is to define indexes on your collections.
According to MongoDB's official documents:
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
-- https://docs.mongodb.com/manual/indexes/
Using indexes may cause increased collection size but they improve the efficiency a lot.
Indexes are commonly defined on fields which are frequently used in queries. In this case, you may want to define indexes on date and/or vote fields.
Read mongoose documentation to find out how to define indexes in your schemas:
http://mongoosejs.com/docs/guide.html#indexes

ArangoDB Slow Query

I'm new to ArangoDB, having trouble optimizing my queries and am hoping for some help.
The query I've provided below is a real example that I'm struggling with, 758.078 ms on my dev database but on staging, with a much larger dataset, it takes 531.511 s.
I'll also provide the size of each of the edge tables I'm traversing in dev and staging. Any help is really appreciated.
for doc in document
filter repo._key == "my-key"
for v, e, p in 3 any doc edge1, edge2, edge3
options {uniqueVertices: 'global', bfs: true}
filter DATE_ISO8601(p.vertices[2].date) > DATE_ISO8601("2017-09-04T00:00:01Z")
and DATE_ISO8601(p.vertices[2].date) < DATE_ISO8601("2017-09-15T23:59:59Z")
limit 1
return {
commit: p.vertices[2].hash,
date: p.vertices[2].date,
message: p.vertices[2].message,
author: p.vertices[1].email,
loc: p.vertices[3].stats.additions
}
DEV
edge1: 2,638
edge2: 2,560
edge3: 386
STAGING
edge1: 5,438,811
edge2: 5,544,028
edge3: 423,545
The query is potentially slow because the filter condition
filter
DATE_ISO8601(p.vertices[2].date) > DATE_ISO8601("2017-09-04T00:00:01Z"
and
DATE_ISO8601(p.vertices[2].date) < DATE_ISO8601("2017-09-15T23:59:59Z")
is not applied during the traversal but only afterwards.
Potentially this is due to the function calls (to DATE_ISO8601) in the filter conditions. If your date values are stored as numbers, can you check whether the following filter condition speeds up the query:
filter
p.vertices[2].date > DATE_TIMESTAMP("2017-09-04T00:00:01Z"
and
p.vertices[2].date < DATE_TIMESTAMP("2017-09-15T23:59:59Z")
That modified filter condition should allow pulling the filter condition inside the traversal, so it is executed earlier.
You can verify the query execution plans using db._explain(<query string goes here>); in the ArangoShell or from the web interface's AQL editor.
Probably a bit late but it will help someone. Using the DATE function will make the query much slower, so if possible remove the DATE function. For example
enter image description here
You can see that the query filter command uses a Date function that is ~ 7s. If you do not use the date function, it will run faster than ~ 0.5s. These two lines will query the data of 2018-09-29.

Speeding up my cloudant query

I was wondering whether someone could provide some advice on my cloudant query below. It is now taking upwards of 20 seconds to execute against a DB of 50,000 documents - I suspect I could be getting better speed than this.
The purpose of the query is to find all of my documents with the attribute "searchCode" equalling a specific value plus a further list of specific IDs.
Both searchCode and _id are indexed - any ideas why my query would be taking so long / what I could do to speed it up?
mydb.find({selector: {"$or":[{"searchCode": searchCode},{"_id":{"$in":idList}}]}}, function (err, result) {
if(!err){
fulfill(result.docs);
}
else{
console.error(err);
}
});
Thanks,
James
You could try doing separate calls for the queries
find me documents where the searchCode = 'some value'
find me documents whose ids match a list of ids
The first can be achieved with a find call and a query like so:
{ selector: {"searchCode": searchCode} }
The second can be achieved by hitting the databases's _all_docs endpoint, passing in the list of ids as a keys parameter e.g.
GET /db/_all_docs?keys=["a","b","c"]
You might find that running both requests in parallel and merging the results gives you better performance.

mongo: "2d" index and normal index

location: {lat: Number,
lng: Number}
location is a 2d index in my mongodb and I have been using this for geospatial search, which is working fine.
Now if I need to search as db.find({lat:12.121212, lng:70.707070}), will it use the same index ? or, do I need to define a new index ? If so, how ?
I am using mongoose driver in node.js
The 2d index used for doing the geospatial commands is not going to help for an equivalency match on the two fields. For that you will need to define a compound index on the two sub-documents, something like this:
db.collection.ensureIndex({"location.lat" : 1, "location.lng" : 1})
This worked best for me with a test set of data - you can also define a normal index on the location field itself but that will be less efficient. You can test out the relative performance using hint and explain for any index combination. For example:
db.collection.find({"location.lat" : 179.45, "location.lng" : 90.23}).hint("location.lat_1_location.lng_1").explain()
You can do this for any index you wish in fact, though to check the results returned you will need to drop the .explain()
Please also bear in mind that a query can only use one index at a time, so if you are looking to combine the two (equivalency and a geospatial search) then the 2d index will be the only one used.
Note: all of the above examples are from the MongoDB JS shell, not node.js

Resources