Fine Tuning GPT-3 for Consistent Output Format

Fine Tuning GPT-3 for Consistent Output Format - openai-api

I am trying to use Open AI API to create quiz questions with three incorrect answers and one correct answer. The prompt I use is
`Write a quiz on ${quiz_topic} with 5 questions. Each question has 3 incorrect answers and 1 correct answer. The correct answer is always the last answer. Write each answer on a separate line`
I would like to parse ChatGPT's answer but the format is inconsistent. For instance, is quiz_topic is "literature", I get the following response:
Q1. Who wrote the novel "The Catcher in the Rye"?
A. J.D. Salinger
B. Ernest Hemingway
C. F. Scott Fitzgerald
D. Mark Twain
This is the correct format. But if I use "U.S History", I get the following output:
Q1: Who was the first president of the United States?
A1: George Washington Jr., Thomas Jefferson, John Adams, George Washington
This is a different format. So my goal is to fine-tune it using the article here
I have created 200 line of a training set. Here is a sample:
{"prompt":"Write a quiz on general knowledge with 5 questions. Each question has 3 incorrect answers and 1 correct answer. The correct answer is always the last answer.Write each answer on a separate line \n\n---\n\n","completion":" Q. What is the Italian word for "tomato"?\nA. Aglio\nA. Cipolla\nA. Peperoncino\nA. Pomodoro\n\nQ. Which restaurant's mascot is a clown?\nA. Whataburger\nA. Burger King\nA. Sonic\nA. McDonald's\n\nQ. Which of these colours is NOT featured in the logo for Google?\nA. Yellow\nA. Blue\nA. Green\nA. Pink\n\nQ. In 2013 how much money was lost by Nigerian scams?\nA. $95 Million\nA. $956 Million\nA. $2.7 Billion\nA. $12.7 Billion\n\nQ. What is the name of Poland in Polish?\nA. Pupcia\nA. Polszka\nA. Póland\nA. Polska\n\n \n\n###\n\n"}
{"prompt":"Write a quiz on books with 5 questions. Each question has 3 incorrect answers and 1 correct answer. The correct answer is always the last answer.Write each answer on a separate line \n\n---\n\n","completion":" Q. What is the name of Eragon's dragon in "Eragon"?\nA. Glaedr\nA. Thorn\nA. Arya\nA. Saphira\n\nQ. In the "The Hobbit", who kills Smaug?\nA. Bilbo Baggins\nA. Gandalf the Grey\nA. Frodo\nA. Bard\n\nQ. What is Hermione Granger's middle name?\nA. Jane\nA. Emma\nA. Jo\nA. Jean\n\nQ. According to The Hitchhiker's Guide to the Galaxy book, the answer to life, the universe and everything else is...\nA. Loving everyone around you\nA. Chocolate\nA. Death\nA. 42\n\nQ. What is the name of the three headed dog in Harry Potter and the Sorcerer's Stone?\nA. Spike\nA. Poofy\nA. Spot\nA. Fluffy\n\n \n\n###\n\n"}
When I run the validation tool with the command
openai tools fine_tunes.prepare_data -f training.jsonl
I get the following message
- All prompts start with prefix `Write a quiz on `. Fine-tuning doesn't require the instruction specifying the task, or a few-shot example scenario. Most of the time you should only add the input data into the prompt, and the desired output into the completion
I don't understand why I must remove "Write a quiz on". So I have misunderstood how to fine-tune a model for consistent formatting.
Can anybody shed a light on how to make sure I get the same formatting with the same prompt

Related

400 Error when trying to use jsonlines file with OpenAI answers

I uploaded a file to openai for the purpose of answers. I can't figure out how to access it without getting an error.
This is the contents of my file:
{"text":"We are focusing on improving communication to the customers."}
{"text":"The emphasis on the customer is working well."}
{"text":"Services initiatives"}
{"text":"CMM and focus on customer service."}
{"text":"We are working on CMM to focus on the customer."}
{"text":"Our emphasis on making our customers number one."}
{"text":"Customer relationships are continuing to improve."}
This is the information about the file when I query openai:
{"object":"file","id":"file-02fdY5PuZ1aO4cnd2r1PxaB7","purpose":"answers","filename":"f3f1c105-5217-4132-8950-a040b6183ed7","bytes":458,"created_at":1648396701,"status":"processed","status_details":null}
Now I ask a question at https://api.openai.com/v1/answers with this POST:
{"file":"file-02fdY5PuZ1aO4cnd2r1PxaB7","question":"What is the main theme?","search_model": "davinci","model": "davinci","examples_context": "In 2017, U.S. life expectancy was 78.6 years.","examples": [["What is human life expectancy in the United States?","The life expectancy in the United States is 78 years."]],"max_tokens": 100,"stop": ["\n", "<|endoftext|>"]}
The server returns a 400 error. If in my request I replace the "file" field with a "documents" array using same data, the request is successful.
I don't know what more to look at. If someone can suggest something or if you seen an issue in what I am trying to do, I would deeply appreciate hearing from you.
Thanks

Searching a lot of keywords on twitter via tweepy

I am trying to make a python code with tweepy that will track all the tweets from a specific country from a date which will have some of the chosen specific keywords. And I have chosen a lot of keywords like 24-25.
My keywords are vigilance anticipation interesting ecstacy joy serenity admiration trust acceptance terror fear apprehensive amazement surprize distraction grief sadness pensiveness loathing disgust boredom rage anger annoyance.
for more understanding, my code till now is:
places = api.geo_search(query="Canada",granularity="country")
place_id = places[0].id
public_tweets = tweepy.Cursor(api.search,
q="place:"+place_id+" since:2020-03-01",
lang="en",
).items(num_tweets)
Please help me with this question as soon as possible.
Thank You

How to combine multiple columns in CSV file using pandas? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have a csv file for lyrics songs that I took from Genius. Right now, I m preparing my data. I have two column "songs" and "artist". In the "songs" columns I have a lot information: title, album, year, lyrics and URL. I need to separate the column "songs" in 5 columns.
Then I tried to split the data by comma like this:
df = pd.read_csv('output.csv', header=None)
df = pd.DataFrame(df[0].str.split(',').tolist())
But with this code, I got 122 columns, because all the time that I have a comma in lyrics was created other column:
I guess I have to keep all my lyrics inside of double quotes, then if I split by comma the full lyric will remains in one single column.
Someone know how I can do that?
Adding 1 sample of the data:
songs,artist
"{'title': 'Berzerk', 'album': 'The Marshall Mathers LP 2', 'year': '2013-08-27', 'lyrics': '[Verse 1]\nNow this shit\'s about to kick off, this party looks wack\nLet\'s take it back to straight hip-hop and start it from scratch\nI\'m \'bout to bloody this track up, everybody get back\nThat\'s why my pen needs a pad, \'cause my rhymes on the rag\nJust like I did with addiction, I\'m \'bout to kick it\nLike a magician, critics I turn to crickets\nGot \'em still on the fence whether to picket\nBut quick to get it impaled when I tell \'em, ""Stick it!""\nSo sick I\'m looking pale, wait, that\'s my pigment\n\'Bout to go ham, ya bish, shout out to Kendrick\nLet\'s bring it back to that vintage Slim, bitch!\nThe art of MCing mixed with da Vinci and MC Ren\nAnd I don\'t mean Stimpy\'s friend, bitch\nBeen Public Enemy since you thought PE was gym, bitch\n\n[Pre-Chorus]\nKick your shoes off, let your hair down\n(And go berserk) all night long\nGrow your beard out, just weird out\n(And go berserk) all night long\n\n[Chorus 1]\nWe\'re gonna rock this house until we knock it down\nSo turn the volume loud\n\'Cause it\'s mayhem \'til the A.M.\nSo, baby, make just like K-Fed\nAnd let yourself go, let yourself go\nSay ""Fuck it!"" before we kick the bucket\nLife\'s too short to not go for broke\nSo everybody, everybody, go berserk, grab your vial, yeah\n\n[Verse 2]\nGuess it\'s just the way that I\'m dressed, ain\'t it?\nKhakis pressed, Nike shoes crispy and fresh laced\nSo I guess it ain\'t that aftershave\nOr cologne that made \'em just faint\nPlus I showed up with a coat fresher than wet paint\nSo if love is a chess game, check mate\nBut girl, your body\'s bangin\', jump me in, dang, bang-bang\nYes siree \'Bob\', I was thinking the same thang\nSo come get on this Kid\'s rock, baw with da baw, dang-dang\nPow-p-p-p-pow, chica, pow, chica, wow-wow\nGot your gal blowin\' up a valve, valve-valve\nAin\'t slowin\' down, throw in the towel, towel-towel\nDumb it down, I don\'t know how, huh-huh, how-how\nAt least I know that I don\'t know\nQuestion is, are you bozos smart enough to feel stupid?\nHope so, now ho…\n\n[Pre-Chorus]\nKick your shoes off, let your hair down\n(And go berserk) all night long\nGrow your beard out, just weird out\n(And go berserk) all night long\n\n[Chorus 2]\nWe\'re gonna rock this house until we knock it down\nSo turn the volume loud\n\'Cause it\'s mayhem \'til the A.M.\nSo crank the bass up like crazy\nAnd let yourself go, let yourself go\nSay ""Fuck it!"" before we kick the bucket\nLife\'s too short to not go for broke\nSo everybody, everybody, go berzerk, get your vinyls!\n\n[Scratch]\n\n[Verse 3]\nThey say that love is powerful as cough syrup in styrofoam\nAll I know is I fell asleep and woke up in that Monte Carlo\nWith the ugly Kardashian, Lamar, oh\nSorry yo, we done both set the bar low\nFar as hard drugs are though, that\'s the past\nBut I done did enough codeine to knock Future into tomorrow\nAnd girl, I ain\'t got no money to borrow\nBut I am tryin\' to find a way to get you alone: car note\nOh, Marshall Mathers\nShithead with a potty mouth, get the bar of soap lathered\nKangol\'s and Carheartless Cargos\nGirl, you\'re fixin\' to get your heart broke\nDon\'t be absurd, ma\'am, you birdbrain, baby\nI ain\'t called anybody baby since Birdman, unless you\'re a swallow\nWord, Rick? (Word, man, you heard)\nBut don\'t get discouraged, girl\nThis is your jam, unless you got toe jam\n\n[Pre-Chorus]\nKick your shoes off, let your hair down\n(And go berserk) all night long\nGrow your beard out, just weird out\n(And go berserk) all night long\n\n[Chorus 1]\nWe\'re gonna rock this house until we knock it down\nSo turn the volume loud\n\'Cause it\'s mayhem \'til the A.M.\nSo, baby, make just like K-Fed\nAnd let yourself go, let yourself go\nSay ""Fuck it!"" before we kick the bucket\nLife\'s too short to not go for broke\nSo everybody, everybody, go berserk, grab your vial, yeah', 'image': 'https://images.genius.com/a47bb228d28fd8a0e6e73abfabef7832.1000x1000x1.jpg'}",Eminem

Try this.
import ast
import pandas as pd
raw = pd.read_csv("output.csv")
raw["songs"] = raw["songs"].apply(lambda x: ast.literal_eval(x))
songs = raw["songs"].apply(pd.Series)
result = pd.concat([raw[["artist"]], songs], axis=1)
result.head()

Basic PRINT LIN(2)

I was looking through the 1975 Oregon Trail Basic Code and found this line in it:
PRINT LIN(2)
I have searched quite a few places but can't find any reference to it.
Can anyone tell me what this means?
Edit:
Sorry, I was wondering what the PRINT LIN(2) meant. Does anyone know what that means?

"Oregon Trail" source of 1975 at www.filfre.net/misc/oregon1975.bas was written in BASIC for a HP-2100 system.
This HP-2100 system was a series of minicomputers produced by Hewlett-Packard.
This system run an interpreted BASIC named "HP Time-Shared BASIC".
This is the reference manual of "TimeShared BASIC/2000 Level F".
About:
PRINT LIN(2)
Generates a carriage return and 2 (two) line feeds.
"Oregon Trail" for year 1978 at www.filfre.net/misc/oregon1978.bas was written using BASIC for "CDC Cyber range of mainframe-class supercomputers of Control Data Corporation (CDC)".
Documentation
http://bitsavers.org/pdf/cdc/cyber/lang/basic/
19983900K_BASIC_Version_3_Reference_Manual_Aug84.pdf
I compare both sources (strip line number without reference by THEN, GOTO or GOSUB) at
Oregon Trail Compare

Can't Input Tab Delimited file to Stanford Classifier

I'm having a problem inputting tab delimited files into the stanford classifier.
Although I was able to successfully walk through all the included stanford tutorials, including the newsgroup tutorial, when I try to input my own training and test data it doesn't load properly.
At first I thought the problem was that I was saving the data into a tab delimited file using an Excel spreadsheet and it was some kind of encoding issue.
But then I got exactly the same results when I did the following. First I literally typed the demo data below into gedit, making sure to use a tab between the politics/sports class and the ensuing text:
politics Obama today announced a new immigration policy.
sports The NBA all-star game was last weekend.
politics Both parties are eyeing the next midterm elections.
politics Congress votes tomorrow on electoral reforms.
sports The Lakers lost again last night, 102-100.
politics The Supreme Court will rule on gay marriage this spring.
sports The Red Sox report to spring training in two weeks.
sports Messi set a world record for goals in a calendar year in 2012.
politics The Senate will vote on a new budget proposal next week.
politics The President declared on Friday that he will veto any budget that doesn't include revenue increases.
I saved that as myproject/demo-train.txt and a similar file as myproject/demo-test.txt.
I then ran the following:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt
The good news: this actually ran without throwing any errors.
The bad news: since it doesn't extract any features, it can't actually estimate a real model and the probability defaults to 1/n for each item, where n is the number of classes.
So then I ran the same command but with two basic options specified:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
That yielded:
Exception in thread "main" java.lang.RuntimeException: Training dataset could not be processed
at edu.stanford.nlp.classify.ColumnDataClassifier.readDataset(ColumnDataClassifier.java:402)
at edu.stanford.nlp.classify.ColumnDataClassifier.readTrainingExamples (ColumnDataClassifier.java:317)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:1652)
at edu.stanford.nlp.classify.ColumnDataClassifier.main(ColumnDataClassifier.java:1628)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:670)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(ColumnDataClassifier.java:267)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:396)
... 3 more
These are exactly the same results I get when I used the real data I saved from Excel.
Even more though, I don't know how to make sense of the ArrayIndexOutOfBoundsException. When I used readline in python to print out the raw strings for both the demo files I created and the tutorial files that worked, nothing about the formatting seemed different. So I don't know why this exception would be raised with one set of files but not the other.
Finally, one other quirk. At one point I thought maybe line breaks were the problem. So I deleted all line breaks from the demo files while preserving tab breaks and ran the same command:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
Surprisingly, this time no java exceptions are thrown. But again, it's worthless: it treats the entire file as one observation, and can't properly fit a model as a result.
I've spent 8 hours on this now and have exhausted everything I can think of. I'm new to Java but I don't think that should be an issue here -- according to Stanford's API documentation for ColumnDataClassifier, all that's required is a tab delimited file.
Any help would be MUCH appreciated.
One last note: I've run these same commands with the same files on both Windows and Ubuntu, and the results are the same in each.

Use a properties file. In the example Stanford classifier example
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
2.useSplitWords=true
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=\\s+
The number 2 at the start of lines 3, 4 and 5 signifies the column in your tsv file. So in your case you would use
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+
or if you want to run with command line arguments
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -1.useSplitWords =1.splitWordsRegexp "\s+"

I've faced the same error as you.
Pay attention on tabs in the text you are classifying.
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
This means, that at some point classifier expects array of 3 elements, after it splits the string with tabs.
I've run a method, that counts amount of tabs in each line, and if at some line you have not two of them - here is an error.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string