I am trying to create a prompt for OpenAI engines. I want the engine to give me a text block (sentence, paragraph, story, short article, etc) where it uses a specific word I ask of in it, and the definition of the word I choose. (Vocabulary in context)
For example,
"Generate me a 4 sentence paragraph using the word 'present' where 'present' means 'in a particular place.'"
That should give me a short paragraph and use the word "present" at least once throughout it wherever or however it may be. And when it does use the word present, it should mean "in a particular place" given the context.
When given the above prompt, Davinci-002 engine is able to satisfy the prompt, but I am currently trying to get Curie-001 to satisfy it as well, even if worse. (for the higher speed)
Curie-001 is not able to do that, and I don't think it's that the engine is not capable. I think it is bad phrasing in my prompt or some settings I am not using.
Is there anything I can do to make Curie-001 capable of this? Or any other engine.
2 example outputs of each engine:
Davinci-002:
I am present in this room. You are present in that room. He is present in the other room. She is present in the last room.
I am present in my office. You are present in your home. He is present in his laboratory. She is present in her studio.
Curie-001:
I'm sitting in my living room, present in a particular place. My cat is sleeping on the couch next to me, present in a particular place. Outside of my window, the sun is shining and the birds are singing, present in a particular place.
I'm giving you a present. It's a surprise.
(the first one is really good, but the one following isn't. super inconsistent)
Related
I want to create a chatbot. One of the intents is who-is which allows users to ask "Who is" for Arabic names to get information on a person. I have people stored in a database (firestore). I would like the user to say "Who is Saalih Uthaymeen?" But they may ask "Who is Shaykh Saalih al-'Uthaymeen?" as well. The names are different, but they're the same really. And they're close in spelling.
I noticed a synonyms feature in dialogflow, but I don't have the names stored in dialogflow so I don't know if it's useful. Since the synonyms have similar spelling, can't dialogflow help? Otherwise, I see the following options:
Option 1. Manually create every name and its synonyms in dialogflow.
This is manual hard work. Even if I programmatically enter every name, I have to manually enter every synonym. And I have 2 or 3 hundred names.
Option 2. Manually create synonyms in my database.
Basically, I have a table of people... so I would create a new table mapping every person to every synonym possible for his name. Since the spellings will be very similar, I'm confident this is a waste of work and time.
Is there any other option Dialogflow offers?
I am working with a document, where each row contains a description for a specific incident (fire incidents, where firefighters turn up and thereafter write a report).
The incidents/reports are written by several different people, so the language varies a lot, which makes it difficult to code for one specific context using one word: is.number(search(substring;text))
Because even if the word is in the text piece, the context is not related to what I am trying to analyse.
I want to broaden my word search to be more flexible, by being able to "put" or "store" several different words/phrases into my "substring" - being able to get closer to the specific context that I wish to analyse.
This way to cover more data that is in fact related, but different in how it is described in the individual incident reports.
I have tried to search for a solution myself, but am unsure on how to phrase this specific inquiry.
So far I have only been able to use the code piece above, which is a bit insufficient, when trying to comb through 2000 rows.
I hope that someone is able to help me!
Thank you
An example:
Store the following words: stopped fire, killed fire, fire was put out into: Killed fire
So that when I use Killed fire all the above wordings are included in my search.
Is OpenNLP able to extract keyword from content?
If yes, how?
If no, which tool should I use?
I would like to tag content automatically.
For example.
Jessica Chastain has revealed that a meeting has taken place with Marvel over an undisclosed role, although the star has confirmed it is not Captain Marvel.
“We’ve talked about aligning our forces in the future,” Chastain told MTV of her relationship with the studio. “And here’s the thing with me… If you’re going to be in a superhero movie, you only get one chance.”
“You’re that character forever. So why do a superhero movie and play the boring civilian?” A possible reference to Maya Hansen there? Chastain had been attached to the Iron Man 3 character before eventually dropping out on account of scheduling difficulties…
“I don’t want to say too much,” continues the star, “but there was one thing, there was a possibility in the future of the character becoming… And I was like, ‘I understand that, but I want to do it now!’”
Just who that character might be is up for interpretation, although Chastain has moved to quash subsequent rumours that she is in line to play Captain Marvel.
It should be tagged as "superhero", "movie".
Is OpenNLP able to do this?
Thanks.
OpenNLP is able to extract Named entities for you. This means anything that is the name of a person, place, organization etc. would potentially be recognized by the system.
However, what you are looking for is keyword extraction, where you want to identify relevant keywords that explain a document in the general sense. I would recommend checking out Alchemyapi.com
They have models to extract keywords, taxonomy, named entities amongst other things. The only issue is that the free version just gives you 1000 transactions per day (which might be enough for your task)
I have to write some code working with locales. Is there a good introduction to the subject to get me started?
First posted at Everything you need to know about Locales
A long time ago when I was a senior developer in the Windows group at Microsoft, I was sent to the Far East to help get the F.E. version of Windows 3.1 shipped. That was my introduction to localizing software – basically being pushed in to the deep end of the pool and told to learn how to swim. This is where I learned that localization is a lot more than translation.
Note: One interesting thing we hit - the infamous Blue Screen of Death switched the screen into text mode. You can't display Asian languages in text mode. So we (and by we I mean me) came up with a system where we put the screen in VGA mode, stored the 12 pt. courier bitmap at the resolution for just the characters used in BSoD messages, and rendered it that way. You kids today have it so easy J.
So keep in mind that taking locale into account can lead to some very unexpected work.
The Locale
Ok, so forward to today. What is a locale and what do you need to know? A locale is fundamentally the language and country a program is running under. (There can also be a variant added to the country but use of this is extremely rare.) The locale is this combination but you can have any combination of these two parts. For example a Spanish national in Germany would set es_DE so that their user interface is in Spanish (es) but their country settings are in German(DE). Do not assume location based on language or vice-versa.
The language part of the locale is very simple - that's what language you want to display the text in your app in. If the user is a Spanish speaker, you want to display all text in Spanish. But what dialect of Spanish - it is quite different between Spain and Mexico (just as in America we spell color while in England it's colour). So the country can impact the language used, depending on the combination.
All languages that support locale specific resources (which is pretty much all of them today) use a fall-back system. They will first look for a resource for the language_country combination. While es_DE has probably never been done, there often is an es_MX and es_ES. So for a locale set to es_MX it will first look for the es_MX resource. If that is not found, it then looks for the es resource. This is the resource for that language, but not specific to any country. Generally this is copied from the largest country (economically) for that language. If that is not found, it then goes to the "general" resource which is almost always the native language the program was written in.
The theory behind this fallback is you only have to define different resources for the more specific resources - and that is very useful. But even more importantly, when new parts of the UI are made and you want to ship beta copies or you release before you can get everything translated, well then the translated parts are in localized but the untranslated parts still display - but in English. This annoys the snot out of users in other countries, but it does get them the program sooner. (Note: We use Sisulizer for translating our resources - good product.)
The second half is the country. This is used primarily for number and date/time settings. This spans the gamut from what the decimal and thousand separator symbols are (12,345.67 in the U.S. is 12 345,67 in Russia) to what calendar is in use. The way to handle this is by using the run-time classes available for all operations on these elements when interacting with a user. Classes exist for both parsing user entered values as well as displaying them.
Keep a clear distinction between values the user enters or are displayed to the user and values stored internally as data. A number is a string in an XML file but in the XML file it will be "12345.67" (unless someone did something very stupid). Keep your data strongly typed and only do the locale specific conversions when displaying or parsing text to/from the user. Storing data in a locale specific format will bite you in the ass sooner or later.
Chinese
Chinese does not have an alphabet but instead has a set of glyphs. The People's Republic of China several decades ago significantly revised how to draw the glyphs and this is called simplified. The Chinese glyphs used elsewhere continued with the original and that is called traditional. It is the exact same set of characters, but they are drawn differently. It is akin to our having both a text A and a script A - they both mean the same thing but are drawn quite differently.
This is more of a font issue than a translation issue, except that wording and usage has diverged a bit, in part due to the differences in approach between traditional and simplified Chinese. The end result is that you generally do want to have two Chinese language resources, one zh_CN (PRC) and one zh_TW (Taiwan). As to which should be the zh resource - that is a major geopolitical question and you're on your own (but keep in mind PRC has nukes - and you don't).
Strings with substituted values
So you need to display the message Display ("The operation had the error: " + msg); No, no, no! Because in another language the proper usage could be Display("The error: " + msg + " was caused by the operation"); Every modern run-time library has a construct where you can have a string resource "The operation had the error: {0}" and will then substitute in your msg at {0}. (Some use a syntax other than {0}, {1}, …)
You store these strings in a resource file that can be localized. Then when you need to display the message, you load it from the resources, substitute in the variables, and display it. The combination of this, plus the number & date/time formatters make it easy to build up these strings. And once you get used to them, you'll find it easier than the old approach. (If you are using Visual Studio - download and install ResourceRefactoringTool to make this trivial.)
Arabic, Hebrew, and complex scripts.
Arabic & Hebrew are called b-directional because parts of it are right to left while other parts are left to right. The text in Arabic/Hebrew are written and read right to left. But when you get to Latin text or numbers, you then jump to the left-most part and read that left to right, then jump back to where that started and read right to left again. And then there is punctuation and other non-letter characters where the rules depend on where they are used.
Here's the bottom line - it is incredibly complex and there is no way you are going to learn how it works unless you take this on as a full-time job. But not to worry, again the run-time libraries for most languages have classes to handle this. The key to this is the text for a line is stored in the order you read the characters. So in the computer memory it is in left to right order for the order you would read (not display) the characters. In this way everything works normally except when you display the text and determine moving the caret.
Complex scripts like Indic scripts have a different problem. While they are read left to right, you can have cases where some combinations of letters are placed one above the other, so the string is no wider on the screen when the second letter is added. This tends to require a bit of care with caret movement but nothing more.
We even have cases like this in English where ae is sometimes rendered as a single æ character. (When the human race invented languages, they were not thinking computer friendly.)
Don't Over-Stress it
It seems like a lot but it's actually quite simple. In most cases you need to display text based on the closest resource you have. And you use the number & date/time classes for all locales, including your native one. No matter where you live, most computer users are in another country speaking another language - so localizing well significantly increases your potential market.
And if you're a small company, consider offering a free copy for people who translate your product. When I created Page 2 Stage I offered a free copy (list price $79.95) for translating it - and got 28 translations. I also met some very nice people online in the process. For an enterprise level product, many times a VAR in another country will translate it for you at a reduced rate or even free if they see a good market potential. But in these cases, do the first translation in-house to get the kinks worked out.
One resource I find very useful is the Microsoft Language Portal where you can put in text in English and if that text is in any of the Microsoft products, it will give you the translation Microsoft used for a given language. This can give you a fast high-quality translation for up to 80% of your program in many cases.
Удачи! (Good Luck)
My reading of this article suggests that a benefit of ReCAPTCHA is that it can have humans verify words not recognised in the OCR/digitization of books. It does this by using these words in "Are you human?" tests. So ReCAPTCHA kills two birds with one stone. Great!
But I dont get it. If the word can't be recognised by the digitization process then what is the input entered, by the supposed human being, verified against? How does this work?
It shows two words. One of them the computer already knows, the other, it doesn't. It assumes that if you get the known one right, that you must know the other.
You don't know which of the two is already known so you, theoretically can't trick it. Additionally, it will replay a word with multiple people to get independent confirmation before sending it back to the source (newspaper company, book scanning group) as a valid answer.
But if a computer can't read such a
CAPTCHA, how does the system know the
correct answer to the puzzle? Here's
how: Each new word that cannot be read
correctly by OCR is given to a user in
conjunction with another word for
which the answer is already known. The
user is then asked to read both words.
If they solve the one for which the
answer is known, the system assumes
their answer is correct for the new
one. The system then gives the new
image to a number of other people to
determine, with higher confidence,
whether the original answer was
correct.
http://recaptcha.net/learnmore.html
Quoted from LEARN HOW reCAPTCHA WORKS
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.