Fine Tuning an OpenAI GPT-3 model on a collection of documents - openai-api

According to the documentation https://beta.openai.com/docs/guides/fine-tuning the training data to fine tune an OpenAI GPT3 model should be structured as follows:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
I have a collection of documents from an internal knowledge base that have been preprocessed into a JSONL file in a format like this:
{ "id": 0, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
{ "id": 1, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
{ "id": 2, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
The documentation then suggests that a model could then be fine tuned on these articles using the command openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>.
Running this results in:
Error: Expected file to have JSONL format with prompt/completion keys. Missing prompt key on line 1. (HTTP status code: 400)
Which isn't unexpected given the documented file structure noted above. Indeed if I run openai tools fine_tunes.prepare_data -f training-data.jsonl then I am told:
Your file contains 490 prompt-completion pairs
ERROR in necessary_column validator: prompt column/key is missing. Please make sure you name your columns/keys appropriately, then retry`
Is this is right approach to trying to fine tune a GTP3 model on collections of documents, such that questions could later be asked about the content of them. What would one put in the prompt and completion fields in this case since I am not starting from a place where I have a collection of possible question and ideal answers.
Have I fundamentally misunderstood the mechanism used to fine tune a GTP3 model? It does make sense to me that GTP3 would need to be trained on possible questions and answers. However, given the base models are already trained and this process is more above providing additional datasets which aren't in the public domain so that questions can be asked about it I would have thought what I want to achieve is possible. As a working example, I can indeed go to https://chat.openai.com/ and ask a question about these documents as follows:
Given the following document:
[Paste the text content of one of the documents]
Can you tell me XXX
And indeed it often gets the answer right. What I'm now trying to do it fine tune the model on ~500 of these documents such that one doesn't have to paste whole single documents each time a question is to be asked and such that the model might even be able to consider content across all ~500 rather than just the single one that user provided.

Fine-tuning is a process of modifying a pre-trained machine learning model to suit the needs of a particular task. It is not done to provide the model with an internal knowledge-base. Instead of fine-tuning the model, you can create a database of embeddings for chunks of data from the knowledge-base. This database can then be used to semantically search for the most relevant information in response to a query. When a query is received, the database can be searched to find the chunk(s) of data that is most similar to the query. This information can then be fed to GPT-3 to provide answers from. With this approach you can easily update the knowledge by adding new chunks of data to the database.

Related

Azure Form Recognizer Table Not Being Properly Extracted

I am using https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/curl-train-extract to build a training model without using Labels.
The problem I am running into is when I run a file through the model (the file was used to train the model), it is not picking up the "table" part. What I mean is, there is no "tables" node.
From what I have seen, it should be able to build this as part of the JSON, but its breaking it down into super granular OCR, such as
{
"key": {
"text": "__Tokens__34",
"boundingBox": null,
"elements": null
},
"value": {
"text": "2 X 3/4",
"boundingBox": [
3.1181,
3.7292,
3.5278,
3.7292,
3.5278,
3.8583,
3.1181,
3.8583
],
"elements": null
},
"confidence": 1.0
}
Am I missing a flag or something?
Thank you in advance.
Seems like the table is not detected automatically with Train without labels, can you please share an image of the table, please remove any PII information. You can also try the Train with labels or the Layout API to see if it recognizes the table automatically.
I had the same problem, but I've noticed it working when I enable Full Text

Creating multiple rules in Azure search Synonym map is not working

I am creating a synonym map like below,
{ "name": "country-synonym",
"format":"solr",
"synonyms": "germany, dl, deutschland\n
india, ind"
}
But when I queried the synonym to view it, it was created like below, instead of two rules, only one rule was created.
{
"#odata.context": "https://#############.search.windows.net/$metadata#synonymmaps/$entity",
"#odata.etag": "###########",
"name": "country-synonym",
"format": "solr",
"synonyms": "germany, dl, deutschland india, ind",
"encryptionKey": null
}
What am I doing wrong?
You answered it correctly in your comment. The search is working correctly, meaning the search terms were posted to the API correctly. The problem is the browser collapsing the newline. Try to look at the raw response in the Inspector & you should see the newline.

npm package to build mongo query from URL query

I have the following mongo documents:
[{
"name": "Robert",
"title": "The art of war",
"description": "The art of war in the 20yh century"
},
{
"name": "Claadius",
"title": "The spring is back",
"description": "I love spring and all the seasons"
}
]
On my GET method, I have a query to perform the search on 1 attribute alone, 2 or 3 together. See example: ?name=Robert&title=war&description=spring
How i can implement this?
This is almost exactly what query-to-mongo was meant for! It converts a query like the one you show into a mongo search criteria that can be passed into a mongo find. It handles a bunch of additional search operators (like >= and !=) which is where it gets complicated.
But if you're willing to trust it, here's an example of an express route that performs a find against a collection using a search query:
https://gist.github.com/pbatey/20d99ff772c29146897834d0f44d1c29
The query-to-mongo parser also handles paging into results with offset and limit.

LUIS - understand any person name

we are building a product on LUIS / Microsoft Bot framework and one of the doubt we have is Person Name understanding. The product is set to use by anyone by just signing up to our website. Which means any company who is signing up can have any number of employees with any name obviously.
What we understood is the user entity is not able to recognize all names. We have created a phrase list but as per we know there is a limit to phrase list (10K or even if its 100K) and names in the world can never have a limit. The other way we are thinking is to not train the entity with utterances. However if we have 100s of customers with 1000s of users each, the utterances will not be a good idea in that case.
I do not see any other way of handling this situation. Probably I am missing something here? Anyone faced similar problem and how it is handled?
The worst case would be to create a separate LUIS instance for each customer but that's really a big task to do only because we cant handle names.
As you might already know, a person's name could literally be anything: e.g. an animal, car, month, or color. So, there isn't any definitive way to identify something as a name. The closest you can come is via text analysis parts of speech and either taking a guess or comparing to an existing list. LUIS or any other NLP tool is unlikely to help with this. Here's one approach that might work out better. Try something like Microsoft's Text Analytics cognitive service, with a POST to the Key Phrases endpoint, like this:
https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases
and the body:
{
"documents": [
{
"language": "en-us",
"id": "myid",
"text": "Please book a flight for John Smith at 2:30pm on Wednesday."
}
]
}
That returns:
{
"languageDetection": {
"documents": [
{
"id": "e4263091-2d54-4ab7-b660-d2b393c4a889",
"detectedLanguages": [
{
"name": "English",
"iso6391Name": "en",
"score": 1.0
}
]
}
],
"errors": []
},
"keyPhrases": {
"documents": [
{
"id": "e4263091-2d54-4ab7-b660-d2b393c4a889",
"keyPhrases": [
"John Smith",
"flight"
]
}
],
"errors": []
},
"sentiment": {
"documents": [
{
"id": "e4263091-2d54-4ab7-b660-d2b393c4a889",
"score": 0.5
}
],
"errors": []
}
}
Notice that you get "John Smith" and "flight" back as key phrases. "flight" is definitely not a name, but "John Smith" might be, giving you a better idea of what the name is. Additionally, if you have a database of customer names, you can compare the value to a customer name, either exact or soundex, to increase your confidence in the name.
Sometimes, the services don't give you an 100% answer and you have to be creative with work-arounds. Please see the Text Analytics API docs for more info.
Have asked this question to few MS guys in my local region however it seems there is no way LUIS at moment can identify names.
Its not good as being NLP, it is not able to handle such things :(
I found wit.ai (best so far) in identifying names and IBM Watson is also good upto some level. Lets see how they turn out in future but for now I switched to https://wit.ai

RESTful API design - naming an "activity" resource

When designing the endpoints for an activity resource that provides information on the activity of other resources such as users and organisations we are struggling with naming conventions.
What would be more semantic:
/organisations/activity
/organisations/activity/${activityId}
/users/activity
/users/activity/${activityId}
OR
/activity/users/${activityId}
/activity/users
/activity/organisations/${activityId}
/activity/organisations
There's not a generic answer for this, especially since the mechanisms doing the lookup/retrieval at the other end, and associated back-ends vary so drastically, not to mention the use case purpose and intended application.
That said, assuming for all intents and purposes the "schema" (or ... endpoint convention from the point of view of the end user) was just going to be flat, I have seen many more of the latter activity convention, as that is the actual resource, which is what many applications and APIs are developed around.
I've come to expect the following style of representation from APIs today (how they achieve the referencings and mappings is a different story, but from the point of view of API reference)
-
{
"Activity": [
{
"date": "1970-01-01 08:00:00",
"some_other_resource_reference_uuid": "f1c4a41e-1639-4e35-ba98-e7b169d1c92d",
"user": "b3ababc4-461b-404a-a1a2-83b4ca8c097f",
"uuid": "0ccf1b41-aecf-45f9-a963-178128096c97"
}
],
"Users": [
{
"email": "johnanderson#mycompany.net",
"first": "John",
"last": "Anderson",
"user_preference_1": "somevalue",
"user_property_1": "somevalue",
"uuid": "b3ababc4-461b-404a-a1a2-83b4ca8c097f"
}
]
}
The StackExchange API allows retrieving objects through multiple methods also:
For example, the User type look like this:
-
{
"view_count": 1000,
"user_type": "registered",
"user_id": 9999,
"link": "http://example.stackexchange.com/users/1/example-user",
"profile_image": "https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?d=identicon&r=PG",
"display_name": "Example User"
}
And on the Question type, the same user is shown underneath the owner object :
-
{
"owner": {
"user_id": 9999,
"user_type": "registered",
"profile_image": "https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?d=identicon&r=PG",
"display_name": "Example User",
"link": "https://example.stackexchange.com/users/1/example-user"
},
"is_answered": false,
"view_count": 31415,
"favorite_count": 1,
"down_vote_count": 2,
"up_vote_count": 3,
"answer_count": 0,
"score": 1,
"last_activity_date": 1494871135,
"creation_date": 1494827935,
"last_edit_date": 1494896335,
"question_id": 1234,
"link": "https://example.stackexchange.com/questions/1234/an-example-post-title",
"title": "An example post title",
"body": "An example post body"
}
On the Posts Type reference (Using this as a separate example because there is only a handful of methods to reach this type), you'll see an example down the bottom :
Methods That Return This Type
  posts
  posts/{ids}
  users/{ids}/posts 2.2
  me/posts 2.2
So whilst you can access resources (or "types" as it is on StackExchange), through a number of ways including filters and complex queries, there still exists the ability to see the desired resource through a number of more direct transparent URI conventions.
Different applications will clearly have different requirements. For example, the Gmail API is user based all the way - this makes sense from a users point of view given that in the context of the authenticated credential, you're separating one users objects from another.
This doesn't mean google uses the same convention for all of their APIs, their Activities API resource is all about the activity
Even looking at the Twitter API, there is a Direct Messages endpoint resource that has sender and receiver objects within.
I've not seen many API's at all that are limited to accessing resources purely via a user endpoint, unless the situation obviously calls for it, i.e. the Gmail example above.
Regardless of how flexible a REST API can be, the minimum I have come to expect is that some kind of Activity, location, physical object, or other entity is usually it's own resource, and the user association is plugged in and referenced at various degrees of flexibility (at a minimum, the example given at the top of this post).
It should be pointed out that in a true REST api the uri holds no meaning. It's the link relationships from your organizations and users resources that matter.
Clients should just discover those urls, and should also adapt to the new situation if you decide that you want a different url structure after all.
That being said, it's nice to have a logical structure for this type of thing. However, either is fine. You're asking for an opinion, there is not really a standard or best practice. That said, I would choose option #1.

Resources