npm package to build mongo query from URL query - node.js

I have the following mongo documents:
[{
"name": "Robert",
"title": "The art of war",
"description": "The art of war in the 20yh century"
},
{
"name": "Claadius",
"title": "The spring is back",
"description": "I love spring and all the seasons"
}
]
On my GET method, I have a query to perform the search on 1 attribute alone, 2 or 3 together. See example: ?name=Robert&title=war&description=spring
How i can implement this?

This is almost exactly what query-to-mongo was meant for! It converts a query like the one you show into a mongo search criteria that can be passed into a mongo find. It handles a bunch of additional search operators (like >= and !=) which is where it gets complicated.
But if you're willing to trust it, here's an example of an express route that performs a find against a collection using a search query:
https://gist.github.com/pbatey/20d99ff772c29146897834d0f44d1c29
The query-to-mongo parser also handles paging into results with offset and limit.

Related

Fine Tuning an OpenAI GPT-3 model on a collection of documents

According to the documentation https://beta.openai.com/docs/guides/fine-tuning the training data to fine tune an OpenAI GPT3 model should be structured as follows:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
I have a collection of documents from an internal knowledge base that have been preprocessed into a JSONL file in a format like this:
{ "id": 0, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
{ "id": 1, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
{ "id": 2, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
The documentation then suggests that a model could then be fine tuned on these articles using the command openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>.
Running this results in:
Error: Expected file to have JSONL format with prompt/completion keys. Missing prompt key on line 1. (HTTP status code: 400)
Which isn't unexpected given the documented file structure noted above. Indeed if I run openai tools fine_tunes.prepare_data -f training-data.jsonl then I am told:
Your file contains 490 prompt-completion pairs
ERROR in necessary_column validator: prompt column/key is missing. Please make sure you name your columns/keys appropriately, then retry`
Is this is right approach to trying to fine tune a GTP3 model on collections of documents, such that questions could later be asked about the content of them. What would one put in the prompt and completion fields in this case since I am not starting from a place where I have a collection of possible question and ideal answers.
Have I fundamentally misunderstood the mechanism used to fine tune a GTP3 model? It does make sense to me that GTP3 would need to be trained on possible questions and answers. However, given the base models are already trained and this process is more above providing additional datasets which aren't in the public domain so that questions can be asked about it I would have thought what I want to achieve is possible. As a working example, I can indeed go to https://chat.openai.com/ and ask a question about these documents as follows:
Given the following document:
[Paste the text content of one of the documents]
Can you tell me XXX
And indeed it often gets the answer right. What I'm now trying to do it fine tune the model on ~500 of these documents such that one doesn't have to paste whole single documents each time a question is to be asked and such that the model might even be able to consider content across all ~500 rather than just the single one that user provided.
Fine-tuning is a process of modifying a pre-trained machine learning model to suit the needs of a particular task. It is not done to provide the model with an internal knowledge-base. Instead of fine-tuning the model, you can create a database of embeddings for chunks of data from the knowledge-base. This database can then be used to semantically search for the most relevant information in response to a query. When a query is received, the database can be searched to find the chunk(s) of data that is most similar to the query. This information can then be fed to GPT-3 to provide answers from. With this approach you can easily update the knowledge by adding new chunks of data to the database.

Querying mongoDB using pymongo (completely new to mongo/pymongo)

If this question seems too trivial then please let me know in the comments, I will do further research on how to solve it.
I have a collection called products where I store details of a particular product from different retailers. The schema of a document looks like this -
{
"_id": "uuid of a product",
"created_at": "timestamp",
"offers": [{
"retailer_id": 123,
"product_url": "url - of -a - product.com",
"price": "1"
},
{
"retailer_id": 456,
"product_url": "url - of -a - product.com",
"price": "1"
}
]
}
_id of a product is system generated. Consider a product like 'iPhone X'. This will be a single document with URLs and prices from multiple retailers like Amazon, eBay, etc.
Now if a new URL comes into the system, I need to make a query if this URL already exists in our database. The obvious way to do this is to iterate every offer of every product document and see if the product_url field matches with the input URL. That would require loading up all the documents into the memory and iterate through the offers of every product one by one. Now my question arises -
Is there a simpler method to achieve this? Using pymongo?
Or my database schema needs to be changed since this basic check_if_product_url_exists() is too complex?
MongoDB provides searching within arrays using dot notation.
So your query would be:
db.collection.find({'offers.product_url': 'url - of -a - product.com'})
The same syntax works in MongoDB shell or pymongo.

Unable to full text search in Solr

I have some data in solr. I want to search which name is Chinmay Sahu See below I have 3 results in output. But I got 3 instead of 1. Because Content name searched partially.
I want to full search those name having Chinmay Sahu only that contents will come.
Output:
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
{
"id": "4e98d680efaab3afe051f3ddc00dc5f2",
"content_id": "1825",
"name": "Chinmay Panda",
"_version_": 1596995745829879800
}
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1259",
"name": "Sasmita Sahu",
"_version_": 1596995745829879800
}
]
Query:
name:Chinmay Sahu
Expected :
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
]
Please help
Try doing this
name:"Chinmay Sahu"
You need to do a phrase query to match the exact name.
I am guessing in your case the name field is using Standard tokenizer which will split tokens if whitespace is there. So while indexing in all the 3 docs there will be a token called "chinmay".
While you search using
name:Chinmay Sahu
Solr will search it like this since if there is no fieldName specified before a token solr automatically searches it in default_field.(however default field is removed from solr 7.3, So it depends on what version of solr are you using.
)
Name:chinmay AND default_field:sahu
So since all the three docs are having chinmay as a token in the index,the query will match all 3 docs.
Now i dont know what your default field is? can you post your solr schema? That way we can explain why you are seeing those 3 docs.
Since root545 already explained that field:foo bar will search for foo in field and bar in the default search field, I'll suggest that it seems like you don't want to concern yourself with the exact Lucene syntax for searching. The edismax query parser is well suited for separating the typed search string from what fields are being searched and whether you want all tokens to match.
The query in that case would be just Chinmay Sahu, while you'd set q.op=AND (all terms must match), defType=edismax (use the edismax query parser) and qf=name (search the name field):
q=Chinmay Sahu&q.op=AND&defType=edismax&qf=name
You can also tune the different phrase parameters to make sure that names with the tokens in the exact same sequence will be boosted higher than those that have them in the opposite sequence (i.e. Sahu Chinmay).
If this is a programmatic search where no user is actually typing in the suggestion, using a phrase search as suggested is the way to go (name:"Chinmay Sahu").
I would suggest using query like
name:(Chinmay Sahu)
And make sure default operator is AND either in settings or query string like q.op=AND
With that approach you can use user input much easier since you don't need to parse it too much.

How to find an object which is at nth nested level in mongoDB? (single collection, single document)

I am trying to find an nth object using '_id', which is in the same document.
Any suggestions or references or code samples would be appreciated.
(e.g)
Document will look as below:
{
"_id": "xxxxx",
"name": "One",
"pocket": [{
"_id": "xxx123",
"name": "NestedOne",
"pocket": []
}, {
"_id": "xxx1234",
"name": "NestedTwo",
"pocket": [{
"_id": "xxx123456",
"name": "NestedTwoNested",
"pocket": [{"_id": "xxx123666",
"name": "NestedNestedOne",
"pocket": []
}]
}]
}]
}
The pockets shall hold more pockets and it is dynamic.
Here, I would like to search "pocket" using "_id" , say "xxx123456", but without using static reference.
Thanks again.
I highly recommend you change your document structure to something easier to manage/search, as this will only become more of a pain to work with.
Why not use multiple collections, like explained in this answer?
So an easy way to think about this for your situation, which I hope is easier for you to reason about than dropping some schema code...
Store all of your things as children in the same document. Give them unique _ids.
Store all of the contents of their pockets as collections. The collections simply hold all the ids that would normally be inside the pocket (instead of referencing the pockets themselves).
That way, a lot of your work can happen outside of DB calls. You just batch pull out the items you need when you need them, instead of searching nested documents!
However, if you can work with the entire document:
Looks like you want to do a recursive search a certain of number of levels deep. I'll give you a general idea with some pseudocode, in hopes that you'll be able to figure the rest out.
Say your function will be:
function SearchNDeep(obj, n, id){
/**
You want to go down 1 level, to pocket
see if pocket has any things in it. If so:
Check all of the things...for further pockets...
Once you've checked one level of things, increment the counter.
When the counter reaches the right level, you'd want to then see if the object you're checking has a `'_id'` of `id`.
**/
}
That's the general idea. There is a cleaner, recursive way to do this where you call SearchNDeep while passing a number for how deep you are, base case being no more levels to go, or the object is found.
Remember to return false or undefined if you don't find it, and the right object if you do! Good luck!

Freebase batch search

I'm trying to use Freebase to search for multiple items at a time (using one API call). For example, if I have two items:
Robert Downey, Jr.
The Avengers
I want to query Freebase once and get back results for both items. Basically all I need is the mid for the top 3 or 4 results for both items. I would like to rely on Freebase's search API to provide disambiguation for topics. For example, I'd like to be able to search for "Robert Downey, Jr." with the abbreviation: "RDJ".
This is easy to do when searching one item at a time:
https://www.googleapis.com/freebase/v1/search?query=rdj
Making two calls like this would give me exactly what I'm looking for, but I would like to stay away from making these calls individually.
Reconciliation
I did run across the json-rpc call for reconciliation, and I have tried the following:
Endpoint: https://www.googleapis.com/rpc
POST body:
[
{
"method": "freebase.reconcile",
"apiVersion": "v1",
"params": {
"name": ["RDJ"],
"key": "api_key",
"limit":10
}
},
{
"method": "freebase.reconcile",
"apiVersion": "v1",
"params": {
"name": ["the avengers"],
"key": "api_key",
"limit":10
}
}
]
This works fairly well for Robert Downey, Jr in that I get a result of type /film/actor as I did using the search api. However, for The Avengers, I get a set of results with type /book/book rather than the 2012 film. These results don't seem to be prioritized the same way as the search results.
I tried something similar using json-rpc for a Freebase search method:
{
"method": "freebase.search",
"apiVersion": "v1",
"params": {
"name": ["RDJ"],
"key": "api_key",
"limit":10
}
}
But the "freebase.search" method didn't seem to exist.
One thing to note is that I will not know the expected type of the items I am looking for before hand.
Long story short: I want the exact results the search API provides, but with multiple queries wrapped up into one call.
Am I missing something terribly simple like an OR operator for the search API?? I've been searching for days, but can't seem to find a good solution. I would appreciate any help at all!
Why not just make two calls asynchronously? That would give you the results you need with almost no penalty in latency.
A few relevant facts:
The Reconcile API is still experimental. It's intended for use in reconciling against a type at a minimum and usually scoring using additional property values.
The Search API isn't included in the RPC mechanism because its freeform output doesn't work with the assumptions of the RPC framework. Ditto for the Topic API, although that's not really relevant here.
The Search API has a fairly expressive S-expression language. You don't say if you want the queries scored independently or together, but if you want them ranked jointly, you can use a filter expression like [(any name:rdj name:"The Avengers")]
https://www.googleapis.com/freebase/v1/search?query=&limit=10&filter=%28any%20name:rdj%20name:%22the%20avengers%22%29

Resources