I'm starting on AI chatbots and don't know where to actually start.
what I've imagined is something like this:
Empty chat bot that doesn't know anything
Learns when user asks question and if the bot doesn't know the answer, it'd ask for it
Records all the data learned and parse synonymous questions
Example procedure:
User: what is the color of a ripped mango?
Bot: I don't know [to input answer add !#: at the start]
User: !#:yellow
User: do you know the color of ripped mango?
Bot: yellow
Chatbots, or conversational dialogue systems in general, will have to be able to generate natural language and as you might expect, this is not something trivial. The state-of-the-art approaches usually mine conversations of human-human conversations (such as for example conversations on chat platforms like Facebook or Twitter, or even movie dialogs, basically things which are available in large quantities and resemble natural conversation). These conversations are then for example labelled as question-answer pairs, possibly using pretrained word embeddings.
This is an active area of research in the field of NLP. An example category of used systems is that of "End-to-End Sequence-to-Sequence models" (seq2seq). However, basic seq2seq models have a tendency to produce repetitive and therefore dull responses. More recent papers try to address this using reinforcement learning, as well as techniques like adversarial networks, in order to learn to choose responses. Another technique that improves the system is to extend the context of the conversation by allowing the model to see (more) prior turns, for example by using a hierarchical model.
If you don't really know where to start, I think you will find all the basics you will need in this free chapter of "Speech and Language Processing." by Daniel Jurafsky & James H. Martin (August 2017). Good luck!
Related
My current understanding is that it's possible to extract entities from a text document using toolkits such as OpenNLP, Stanford NLP.
However, is there a way to find relationships between these entities?
For example consider the following text :
"As some of you may know, I spent last week at CERN, the European high-energy physics laboratory where the famous Higgs boson was discovered last July. Every time I go to CERN I feel a deep sense of reverence. Apart from quick visits over the years, I was there for three months in the late 1990s as a visiting scientist, doing work on early Universe physics, trying to figure out how to connect the Universe we see today with what may have happened in its infancy."
Entities: I (author), CERN, Higgs boson
Relationships :
- I "visited" CERN
- CERN "discovered" Higgs boson
Thanks.
Yes absolutely. This is called Relation Extraction. Stanford has developed several useful tools for working on this problem.
Here is there website: http://deepdive.stanford.edu/relation_extraction
Here is the github repository: https://github.com/philipperemy/Stanford-OpenIE-Python
In general here is how the process works.
results = entract_entity_relations("Barack Obama was born in Hawaii.")
print(results)
# [['Barack Obama','was born in', 'Hawaii']]
Of some importance is that only triples are extracted of the form (subject,predicate,object).
You can extract verbs with their dependants using Stanford Parser, for example. E.g., you might get "dependency chains" like
"I :: spent :: at :: CERN".
It is a much tougher task to recognise that "I spent at CERN" and "I visited CERN" and "CERN hosted my visit" (etc) denote the same kind of event. Going into how this can be done is beyond the scope of an SO question, but you can read up literature of paraphrases recognition (here is one overview paper). There is also a related question on SO.
Once you can cluster similar chains, you'd need to find a way to label them. You could simply choose the verb of the most common chain in a cluster.
If, however, you have a pre-defined set of relation types you want to extract and lots of texts manually annotated for these relations, then the approach could be very different, e.g., using machine learning to learn how to recognize a relation type based on annotated data.
Don't know if you're still interested but CoreNLP added a new annotator called OpenIE (Open Information Extraction), which should accomplish what you're looking for. Check it out: OpenIE
Similar to the Stanford parser, you can also use the Google Language API, where you send a string and get a dependency tree response.
You can test this API first to see if it works well with your corpus: https://cloud.google.com/natural-language/
The outcome here is a subject predicate object (SPO) triplet, where your predicate describes the relationship. You'll need to traverse the dependency graph and write a script to parse out the triplet.
There are many ways to do relation extraction. As colleagues mentioned that you have to know about NER and coreference resolution. Different techniques require different approaches. Nowadays, Distant Supervision is most common, and for detecting the relation between entities, they used FREEBASE.
This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:
"I consider myself a pretty diverse character... I'm a professional
wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at
"Armageddon." I'll head bang to AC/DC, and I'm seriously
considering getting a Legend of Zelda tattoo. I'm 420-friendly. I
like to party it up with the frat crowd one night, hang out with
my Burning Man friends the next, play Halo and World of
Warcraft the next, and jam with friends that aren't any younger than
40 the next. My youngest friend is 16, my oldest friend is 66. I'll
sing karaoke at the bars, and I'm my friends' collective
psychiatrist/shoulder."
The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.
My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.
I have 2 questions:
What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
How hard is this problem?
Thanks!
First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.
For instance:
do these profiles have tags (supplied either by the Site or by the
user)?
what does the Site's API make available (assuming that's how you are accessing this data; if you are scraping it, then this doesn't of course apply)? A good example, Facebook. if you read a user's posts, you'll see words like "wrestler", "karaoke", etc. but if you look at what fields are exposed via the Graph API, you'll see that these activities nearly always have an associated FB ID.
I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.
This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.
Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.
So, some background: I'm trying to train a ML system to answer questions about events, where both the event descriptions and questions are posed in natural language; the event descriptions are constrained to being single sentences.
So far the main problem with this has been locating a corpus that describes events with a limited enough vocabulary to pose similar questions across all of the events (e.g. if all of the events involved chess, I could reasonably ask 'what piece moved?' and an answer could be drawn from a decent percentage of the event description sentences).
With that in mind, I'm hoping to find a text source that is tightly focused around describing events within some fairly limited topic (more along the lines of chess commentary than a chess forum, for example).
While I've had some luck with a corpus of air-traffic controller dialogs, most of sentences aren't typical English (they involve a lot of Charlie, Tango, etc.). However, if the format is as I've described then the actual topic of focus is irrelevant, so long as it has one.
Since I plan on building my own corpus out of this text, no tagging is necessary.
The Reuters corpus has a fairly monotonous content (commercial news; CEO appointments, mergers and acquisitions, major deals, etc); I am more familiar with the multilingual v2 but IIRC the v1 corpus was monolingual English. These will be multiple-sentence news stories, but in keeping with journalistic conventions, you can expect the first sentence to form a reasonable gist of the full story. http://about.reuters.com/researchandstandards/corpus/
You might also look at other TREC and especially MUC competition materials; http://en.wikipedia.org/wiki/Message_Understanding_Conference
Have you considered Usenet? It has a bunch of idiosyncratic conventions of its own but something like rec.food.cooking would seem to broadly fit your description. http://groups.google.com/group/rec.food.cooking/ Have a look at e.g. rec.sports.hockey or rec.games.video.arcade as well. There is also the 20 Newsgroups corpus if you are looking for a canonical, well-known corpus, and it contains at least some sports-related newsgroup material. http://people.csail.mit.edu/jrennie/20Newsgroups/
(Maybe in your country the "general public" is comfortable with baseball. Over here it would be football, you know, the kind where you can't use your hands.)
Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
Is there a way to classify a particular sentence/paragraph as funny. There are very few pointers as to where one should go further on this.
There is research on this, it's called Computational Humor. It's an interdisciplinary area that takes elements from computational linguistics, psycholinguistics, artificial intelligence, machine learning etc. They are trying to find out what it is that makes stories or jokes funny (e.g. the unexpected connection, or using a taboo topic in a surprising way etc) and apply it to text (either to generate a funny story or to measure the 'funniness' of text).
There are books and articles about it (e.g. by Graeme Ritchie).
Yes, you should use a Training Corpora to build a predictive model able to detect funny sentences. Sometimes this is known as "Sentiment Analysis" in the literature. Take a look at this article about Sentiment Analysis with LingPipe.
If you can use Java, you can use their library (see license matrix). I found it very useful, not exactly in the same context than you.
The only way to pull this off is to get a couple of thousand people (monkeys won't do, sorry) to look through thousands of funny sentences/stories, rate them, and then build some sort of expert system/neural network out of it. Given the problem scope and the subjectivity of it (a thing funny to one person might not be funny - even offensive - to another), I'd say it's an impossible task.
You can use the same technique as spam filters. Instead of spam/non-spam you classify on funny/not-funny. Look into naive bayesian classifiers for more information.
http://en.wikipedia.org/wiki/Naive_Bayesian_classification
Also, try Computational Humor # Google Scholar if you're serious about getting into the field. Sentiment Analysis has been mentioned too, see wikipedia on that.
Of course, this all depends on what your scope and aims are...