I've literally researched the entire web and couldn't find a satisfactory answer for this so thought I would ask here.
Basically what I'm trying to do is build a full text search query with pagination, which returns results sorted by time.
The problem is, a naive sort like the following doesn't perform at all:
db.collection
.find({ $text: { $search: "hello" } })
.sort({ created_at: -1 })
.limit(100)
.toArray(function(....
And yes, I've of course indexed it with created_at. And as you can see it's limited to 100 items.
So far what I gather is that the full text index in MongoDB doesn't let you sort by any arbitrary attribute in the collection AT ALL, and the only way to sort it is by adding some $meta attribute to sort it based on some internal scoring mechanism.
But that doesn't work for me, and i really want to sort this by created_at.
Maybe I'm misunderstanding the whole thing, but I refuse to believe that no one has come up with a solution for this very obvious use case. Am I missing something? Does anyone know how to sort a large text search result by a collection attribute? At this point I would appreciate ANY shine of light, even if it's a hack.
[EDIT] For example without the limit and sort, the response would look something like this:
[{
"msg": "hello world",
"created_at": 1000
}, {
"msg": "hello",
"created_at": 899
}, {
"msg": "hello hello",
"created_at": 1003
}, {
...
}]
But I want to limit it to only 100, sorted by created_at, AFTER having searched the database for the occurrrence of "hello". I don't care about relevance and I only want to sort so that it's ordered by time.
[{
"msg": "hello hello",
"created_at": 1003
}, {
"msg": "hello world",
"created_at": 1000
}, {
"msg": "hello",
"created_at": 899
}, {
...
}]
Just to be clear, the query DOES work, but it takes very long time even though I have indexed it with created_at. I don't have this issue when I do a similar find-sort-limit pattern with other queries (not full text search), and I think this is specific to full text search.
I am looking for a way to somehow make this query faster.
Related
This is a contrived / made up example that may not make sense practically, but I'm trying to paint a picture:
There is a web service / search API that supports Relay style pagination for searching for products
A user searches makes a search and requests 3 documents back (e.g. first: 3...)
The service takes the request and passes it to a search index
The search index returns:
[
{
"name": "Shirt",
"sizes": ["S"]
},
{
"name": "Jacket",
"sizes": ["XL"]
},
{
"name": "Hat",
"sizes": ["S", "M"]
}
]
The result of this should be expanded, so that each product shows up as an individual record in the result set with one size per result record, so the above example would split the Hat product result into two results, so the final result would be:
[
{
"name": "Shirt",
"sizes": ["S"]
},
{
"name": "Jacket",
"sizes": ["XL"]
},
{
"name": "Hat",
"sizes": ["S"]
}
]
If the SECOND page was requested, the second page would actually start with the second Hat size (M):
[
...
{
"name": "Hat",
"sizes": ["M"]
},
...
]
I'm wondering if there is a common strategy for handling this, or common libraries that I might use to handle some of this logic.
I'm using (https://opensearch.org)[OpenSearch] and Elasticsearch has a "collapse" and "expand" feature that sounds like it almost does what I'd want at the search backend level, but unfortunately I don't think this is actually the case.
In reality what I want to do is likely not even possible 100%, because if the search results change in between queries you might not end up seeing the correct thing on a subsequent page for example, but I still feel like this might be a common enough issue to have some discussion or solution around it.
I'm thinking that one somewhat certain way of handling this is by denormalizing the data in the search index a bit, and just sticking (for my example) a separate document in the index for both the S and M Hat products (even though the rest of the data would be the same). I'd just need to make sure to remove all documents, and would need to come up with unique identifiers in the index for the documents (so somehow encode the Size in the indexed documents ID).
Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.
No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.
Here's my query as it stands:
"query":{
"fuzzy":{
"author":{
"value":query,
"fuzziness":2
},
"career_title":{
"value":query,
"fuzziness":2
}
}
}
This is part of a callback in Node.js. Query (which is being plugged in as a value to compare against) is set earlier in the function.
What I need it to be able to do is to check both the author and the career_title of a document, fuzzily, and return any documents that match in either field. The above statement never returns anything, and whenever I try to access the object it should create, it says it's undefined. I understand that I could write two queries, one to check each field, then sort the results by score, but I feel like searching every object for one field twice will be slower than searching every object for two fields once.
https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzzy-match-query.html
If you see here, in a multi match query you can specify the fuzziness...
{
"query": {
"multi_match": {
"fields": [ "text", "title" ],
"query": "SURPRIZE ME!",
"fuzziness": "AUTO"
}
}
}
Somewhat like this.. Hope this helps.
I'm trying to figure out the best way of sorting by rank using couch db. I have my documents setup in a players db like so:
{
"_id": "user2",
"_rev": "31-65e0e5bb1eba8d6a882aad29b63615a7",
"username": "testName",
"apps": {
"app1": {
"score": 1000
},
"app2": {
"score": 1000
},
"app3": {
"score": 1000
}
}
}
The player can have multiple scores for various apps. I'm trying to figure out the best way to pull say the top 50 scores for app1.
I think one idea could be to store the score of each user for each app seperately. Like so: -
{"app":"app1","user":"user_id","score":"9000"}
The you could write a map function
emit([doc.app,doc.score],{_id:doc.user,score:doc.score})
and query the view like
/view_name?include_docs=true&startkey=["app1",{}]&endkey=["app1"]&descending=true
With this arrangement you have a view sorted by the score and the name of the app. Here are the results that you get.
{"total_rows":2,"offset":0,"rows":[
{"id":"61181c784df9e2db5cbb7837903b63f5","key":["app1",10000],"value":
{"_id":"5002539daa85a05d3fab16158a7861c1","score":10000},"doc":
{"_id":"5002539daa85a05d3fab16158a7861c1","_rev":"1-8441f2f5dbaaf22add8969cea5d83e1b","user":"jiwan"}},
{"id":"7f5d53b2da8ae3bea8e2b7ec74020526","key":["app1",9000],"value":
{"_id":"336c2619b052b04992947976095f56b0","score":9000},"doc":
{"_id":"336c2619b052b04992947976095f56b0","_rev":"3-3e4121c1831d7ecafc056e71a2502f3a","user":"akshat"}}
]}
You have score in value. User in doc.
Edit
Oops! I mistyped the startkey and endkey :) Notice that it is not startKey but startkey same for endkey. Also note that since descending is true we reverse the order of keys. It should work as expected now.
For more help check out
This answer and
This answer
I want to query freebase and get a list of datatypes for a string .. for example if i have a string "jordan" then i want a list of types that can be country, basketball player ... etc.
I would appreciate if someone can point out the MQL query as i dont know the type of the result yet.
Thanks
[{
"id": null,
"name": null,
"type": "/type/type",
"instance": {
"name~=": "jordan",
"id": null,
"name": null,
"limit": 1
}
}]
Note that MQL returns only the first 100 results by default; you'll either to have increase the limit or use cursors to get all the results.
While I'm aware it's not an MQL query directly, you may want to consider using the Freebase Search API rather than MQL to do this kind of thing - for example, do you want to find things with an alias of "Jordan" as well as things with the primary name?