Making brown clusters - nlp

I was playing around with P.Liang's brown clustering code link. To give it a try, I induced clusters on the text of "pride and prejudice". The clusters that I got were not so good. Some examples,
"further agreeable attempting pleasing reference"
"exempt identification deductible Service returns"
"impertinence amazement amusing"
"addresses astonished openly insincere conceit impertinent"
Do I need to perform some preprocessing(like removing stopwords, lemmatizing) before inducing the clusters?

In my experience you get much better clusters if you use larger values of k. You can then use the path prefixes to trim down the number of clusters.

ybisk has a good suggestion, when I tried to replicate your experiment I had better results with larger clusters. Some clusters are difficult to interpret, but there were a few with clear patterns, like this one for relations:
own dear sister father mother friend sister, uncle sisters aunt
sister's daughter manners mother, former father, brother aunt,
daughters mother's dear, friends spirits cousin daughter, husband
Catherine, brother, sister. own, father's feelings, friend, ladyship
eldest thoughts friend. Catherine's sisters, side, marriage, opinion,
friends, acquaintance, daughters, dearest wife, daughter. vanity
cousin,
lemmatizing and removing punctuation/capitalization would probably improve the clusters (I'm noticing a lot of duplicate words with trailing commas/periods in my results). I'm not sure removing stopwords would help, they could contain useful contextual information (e.g. day names will appear near words like "on" more often).

Related

NLP: Resolve coreference pronoun in blocks

I'm planning on executing my NLP pipeline on a corpus of books. Since resolving coreferences is an intensive process, I wouldn't be able to process an entire book or maybe even an entire chapter at a time. I was planning on splitting the text into sizeable chunks to resolve coreferences.
The issue I need help with is how would I resolve pronouns from Group2 when the noun that they're referencing is located in Group1. Is there a way to seed the dependencies from Group1 to the following groups? If not, how is this typically handled?
For what it's worth I'm using CoreNLP, but I'm open to other others.
"Group 1": George was born in New York. George is 10.
"Group 2": He loves New York city.
This may be interesting to read: https://stanfordnlp.github.io/CoreNLP/memory-time.html
And here https://stanfordnlp.github.io/CoreNLP/coref.html they mention the maxMentionDistance setting. I remember modifying that at some point when I used coreNLP for coref resolution. (But in Java directly; since you've tagged your question with NLTK; not sure if setting this is also possible in the NLTK implementation)
I'd use common sense here and try to stick to conceptual blocks as much as possible, i.e. if chapters are too big, try (a couple of) paragraphs. Perhaps you could 'glue' the mention chains back together in post-processing, but I guess that would not be immediately straightforward.

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

How do I get the context of a sentence?

There is a questionnaire that we use to evaluate the student knowledge level (we do this manually, as in a test paper). It consists of the following parts:
Multiple choice
Comprehension Questions (I.e: Is a spider an insect?)
Now I have been given a task to make an expert system that will automate this. So basically we have a proper answer for this. But my problem is the "comprehension questions". I need to compare the context of their answer to the context of the correct answer.
I already initially searched for the answer, but it seems like it's really a big task to do. What I have search so far is I can do this through NLP which is really new to me. Also, if I'm not mistaken, it seems like that I have to find a dictionary of all words that is possible for the examiner to answer.
Am I on the right track? If no, please suggest of what should I do (study what?) or give me some links to the materials that I need. Also, should I make my own dictionary? Because the words that I will be using are in the Filipino language.
Update: Comprehension question
The comprehension section of the questionnaire contains one paragraph explaining a certain scenario. The questions are fairly simple. Here is an example:
Bonnie's uncle told her to pick apples from the tree. Picking up a stick, she poked the fruits so they would fall. In the middle of doing this, a strong gust of wind blew. Due to her fear of the fruits falling on top of her head, she stopped what she was doing. After this, though, she noticed that the wind had caused apples to fall from the tree. These fallen apples were what she brought home to her uncle.
The questions are:
What did Bonnie's uncle tell her to do?
What caused Bonnie to stop picking apples from the tree?
Is Bonnie a good fruit picker? Please explain your answer.
The possible answers that the answer key states are:
For number 1:
1.1 Bonnie's uncle told her to pick apples from the tree
1.2 Get apples
For number 2:
2.1 A strong gust of wind blew
2.2 She might get hit in the head by the fruits
For number 3:
3.1 No, because the apples she got were already on the ground
3.2 No, because the wind was what caused the fruits to fall
3.3 Yes, because it is difficult to pick fruits when it's windy.
3.4 Yes, because at least she tried
Now there are answers that were given to me. The job that the system shall be able to do is to compare the context of the student's answer to the context of the right answer in order for the system to successfully be able to grade the student's answer.
One simplistic way of doing this that I can think of (off the top of my head) is to use a string similarity metric like cosine or jaccard to identify whether certain keywords appear in a test answer and the known correct answer.
Extracting these keywords automatically could be done with part of speech tagging using NLP. For example, you could extract all nouns (and possibly verbs). Then, representing each answer as a vector of keywords, you could compare the test vector with the known correct vector.
For example, in the second question, the vector for the two possible answers could be
gust, wind, blew
hit, head, fruits
An answer like "she picked up a stick" with the keywords: picked, stick would have a very low score as compared to something like "afraid of fruit falling on her head" with keywords: fruit, falling, head.
Notes:
This can detect only wildly wrong answers. Wrong answers containing the right keywords would not be detected by this technique. :)
I'm not sure about non-english sentences. If that is the case, you might want to take every word in the answer as a keyword (removing stopwords). This question might help as well.

nlp: alternate spelling identification

Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.

Fuzzy sentence search algorithms

Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good approach to this problem? I know about Levenhstein's distance, and I also suppose that this problem may have no easy, good solution.
A good text search engine will provide capabilities such as you describe, fsh. A typical approach would be to create a query that matches if any of the words occurs and orders the results using a weight based on number of terms occurring in proximity to each other and weighted inversely to their probability of occurring, since uncommon words will be less likely to co-occur by chance. There's a whole theory of this sort of thing called information retrieval, but maybe you know about that. Furthermore you'd like to make sure that word-level fuzziness gets accounted for by normalizing case, punctuation and the like and applying some basic linguistic transformations (stemming), and in some cases introducing a dictionary of synonyms, especially when there is domain knowledge available to condition it.
If you're interested in messing around with this stuff, try an open-source search engine, this article by Vik gives a reasonable survey from the perspective of 2009, and this one by Middleton and Baeza-Yates gives a good detailed introduction to the topic.

Resources