Query with join on mutiple collections with python in MongoDB - python-3.x

I am newbie to MongoDB and Python (using pymongo 3.10.1). I can query one collection but I need to perform a join with two collections
collection1 {
code
some other fields
}
collection2 {
code
some other fields
}
I would like to achieve:
select * from collection2 left inner join collection1 on collection2.code = collection1.code
I found only basic examples for queries with Python to MongoDB.
How to achieve this with Python ? Could I use .aggregate and $lookup with Pythonn ?

Finally I get it working, here is the full code:
from pymongo import MongoClient
# connect to MongoDB, change the << MONGODB URL >> to reflect your own connection string
client = MongoClient(<< MONGODB URL >>)
db=client.mydb
docs = db.collection1.aggregate([{"$lookup":{
"from": "collection2", # other table name
"localField": "code", # key field in collection 2
"foreignField": "code", # key field in collection 1
"as": "linked_collections" # alias for resulting table
}},
{"$project": {"code": 1, "field_from_collection1": 1, "field_from_collection2": 1}}
])
#$project is to select fields we need, we could ommit it
for doc in docs:
print(doc)

So I feel like there are two parts to your question:
How do you do more complicated queriers with pymongo?
How do you do a join with mongo?
The first question is pretty simple, you can declare any type of query and just use find({<your query>}). Here's an example from W3
The answer to your main question is more complicated. Here's another stack article where it's talked about in more detail. But basically since 3.2 you can use $lookup to do joins.

Related

Query to get all Cosmos DB documents referenced by another

Assume I have the following Cosmos DB container with the possible doc type partitions:
{
"id": <string>,
"partitionKey": <string>, // Always "item"
"name": <string>
}
{
"id": <string>,
"partitionKey": <string>, // Always "group"
"items": <array[string]> // Always an array of ids for items in the "item" partition
}
I have the id of a "group" document, but I do not have the document itself. What I would like to do is perform a query which gives me all "item" documents referenced by the "group" document.
I know I can perform two queries: 1) Retrieve the "group" document, 2) Perform a query with IN clause on the "item" partition.
As I don't care about the "group" document other than getting the list of ids, is it possible to construct a single query to get me all the "item" documents I want with just the "group" document id?
You'll need to perform two queries, as there are no joins between separate documents. Even though there is support for subqueries, only correlated subqueries are currently supported (meaning, the inner subquery is referencing values from the outer query). Non-correlated subqueries are what you'd need.
Note that, even though you don't want all of the group document, you don't need to retrieve the entire document. You can project just the items property, which can then be used in your 2nd query, with something like array_contains(). Something like:
SELECT VALUE g.items
FROM g
WHERE g.id="1"
AND g.partitionKey="group"
SELECT VALUE i.name
FROM i
WHERE array_contains(<items-from-prior-query>,i.id)
AND i.partitionKey="item"
This documentation page clarifies the two subquery types and support for only correlated subqueries.

MongoDB- What is the efficient way to run same query on multiple collections, limit the results and sort them at the end

Assume I have 3 collections and 3 collections have a refId and that refId values may or may not be available in all collections and all the collections have few common fields. So My question is I should be able to find the documents with any refId and return the _id, refId, someCommonFields,dateCreated fields from all the collections and at the end sort them by dateCreated and limit to x num. Assume each collection grows 100 documents for each refId everyday
Example collections:
coll1
{_id:ObjectId("5349b4ddd2781d08c09890f3"), refId:ObjectId("54759eb3c090d83494e2d804"), color:"red",someCommonField:"whocares", dateCreated:ISODate("2020-06-02T03:21:08.870Z")},
{_id:ObjectId("5349b4ddd2781d08c09890f4"), refId:ObjectId("54759eb3c090d83494e2d804"), color:"green",someCommonField:"careswho",dateCreated:ISODate("2020-06-02T03:21:08.879Z")},
{_id:ObjectId("5349b4ddd2781d08c09890f5"), refId:ObjectId("54759eb3c090d83494e2d805"), color:"red", someCommonField:"random", dateCreated:ISODate("2020-06-02T03:21:08.876Z")}
coll2
{_id:ObjectId("5349b4ddd2781d08c09890f6"), refId:ObjectId("54759eb3c090d83494e2d804"), type:"book",someCommonField:"randomstring", dateCreated:ISODate("2020-06-02T03:21:08.873Z")},
{_id:ObjectId("5349b4ddd2781d08c09890f7"), refId:ObjectId("54759eb3c090d83494e2d806"), type:"video",someCommonField:"ignoreme",dateCreated:ISODate("2020-06-02T03:21:08.874Z")},
{_id:ObjectId("5349b4ddd2781d08c09890f8"), refId:ObjectId("54759eb3c090d83494e2d805"), type:"audio", someCommonField:"cool",dateCreated:ISODate("2020-06-02T03:21:08.875Z")}
coll3
{_id:ObjectId("5349b4ddd2781d08c09990f6"), refId:ObjectId("54759eb3c090d83494e2d804"), name:"tom", someCommonField:"mongo4",dateCreated:ISODate("2020-06-02T03:21:08.973Z")},
{_id:ObjectId("5349b4ddd2781d08c09990f7"), refId:ObjectId("54759eb3c090d83494e2d806"), name:"dom",someCommonField:"nodejs",dateCreated:ISODate("2020-06-02T03:21:08.974Z")},
{_id:ObjectId("5349b4ddd2781d08c09990f8"), refId:ObjectId("54759eb3c090d83494e2d805"), name:"pom",someCommonField:"static", dateCreated:ISODate("2020-06-02T03:21:08.975Z")}
Approach1: Using node js promise.all()
lets assume I m using mongoose
const promise1=db.coll1.find({refId:refId}).limit(10)
const promise2=db.coll2.find({refId:refId}).limit(10)
const promise3=db.coll3.find({refId:refId}).limit(10)
async function someRandomFn(){
const [r1,r2,r3]=await Promise.all([promise1,promise2,promise3]);
const results=[...r1,...r2,...r3];
return results.sort((a,b)=>a.dateCreated-b.dateCreated).slice(10).map(({_id,refId,someCommonField,dateCreated})=>{_id,refId,someCommonField,dateCreated});
}
Approach2: Using mongodb aggregate
Do a $match followed by $lookUp, followed by $project. Once the same process is done on 3 collections sort and limit the pipeline.

Get the size of the result of aggregate method MongoDB

I have this aggregate query :
cr = db.last_response.aggregate([
{"$unwind": '$blocks'},
{"$match": {"sender_id": "1234", "page_id": "563921", "blocks.tag": "pay1"}},
{"$project": {"block": "$blocks.block"}}
])
Now i want to get the number of element it returned (is it empty cursor or not).
This is how i did :
I defined an empty array :
x = []
I iterated through the cursor and append the array x:
for i in cr :
x.append(i['block'])
print("length of the result of aggregation cursor :",len(x))
My question is : Is there any faster way to get the number of the result of aggregate query like the count() method of the find() query ?
Thanks
The faster way is that reject operations of transfers all data from mongod to you application. To do this you may add final group stage to count docs
{"$group": {"_id": None, "count": {"$sum": 1}}},
This is mean that mongod do aggregate and get as result count of docs.
Thereis no way to get count of result without execution of aggregation pipeline.

Mongodb upsert document or $addToSet

So I am trying to figure out the best way to do this. I have a dump of documents that I put into a collection. That includes a ID and a timestamp that is a array. Basically what I would like to accomplish is if there is collision on the ID I want to push the new timestamp to the array else I want to upsert the entire document. I don't know if it changes anything but I am using pymongo.
The $addToSet operator can be used together with upsert if you want to keep unique timestamps as well. Otherwise, $push operator will serve to add each timestamp to the end of the array when there is a collision on the ID field.
A sample query to achieve this is as follow:
from pymongo import MongoClient
from bson.objectid import ObjectId
from time import time
client = MongoClient()
db = client.experiments
id_ = ObjectId('5b6d2a8ed35b7caf9fde936f')
ts = time() + 300
db.sample.update_one(
{'ID': id_}, # filter
{'$addToSet': {'timestamp': ts}}, # update
upsert=True
)
More information on pymongo.collection.Collection.update_one is documented.

not in query and select one field from second collection

My requirement is to count all the data whose particular id is not in reference collection. The equivalent SQL query would go as below:
select count(*) from tbl1 where tbl.arr.id not in (select id from tbl2)
I've tried as below, but got stuck up on fetching single field i.e. id from 2nd query.
db.coll1.find(
{$not:
{"arr.id":
{$in:
{db.coll2.find()}//how would I fetch a single column from
//2nd coll2
}
}
}
).count()
Also, Please note that arr.id is an ObjectId stored in collection coll1 and same will go with collection coll2. Should special care be taken while fetching the id like say ObjectId(id)?
Update - I am using mongo db version 3.0.9
I had to use $nin to check for not in condition and get the array in a different format as the version of mongodb was 3.0.9. Below is how I did it.
db.coll1.find({"arr.id":{$nin:[db.coll2.find({},["id"])]}}).count()
For mongodb v>=3.2 it would be as below
db.coll1.find({"arr.id":{$nin:[db.coll2.find({},"id")]}}).count()

Resources