Are there any recourses on determining the countability of nouns? Either some way to work it out or a dictionary that records whether a noun is likely to countable or not countable?
I'm not interested in whether the noun can be countable but more is it likely to be countable. for instance rice can go to rices which means it can be countable but in most cases it wont be.
This is a tough one. Many English words can be both (beer, time, glass, language, etc etc) depending on the context/meaning.
Figuring out (un)countability from the word alone or from a regular dictionary is impossible or impractical.
You can try to figure it out from a large text corpus by seeing how the word is used:
if there's a plural form or not
if there's an indefinite article before it or none
if it's used with many/few, much/little, a piece of(?), etc
But many words can function as both nouns and adjectives and that complicates matters. For example in an air pump, air functions as an adjective and an refers to pump, not to air.
Likewise, many words can function as both nouns and verbs and have identical forms. For example, in she pressures him, pressures isn't a plural of pressure.
Also, some uncountable nouns can have an indefinite article before them when they are made more specific, e.g. knowledge vs a good practical knowledge.
You can gather statistics from an analyzed corpus and based on it judge whether or not a word is more likely to be countable or uncountable.
There are several existing English lexica that contain information about count/mass/etc. distinctions, none of which quite agree with each other because they focus on slightly different distinctions and it's a complicated task. Two are ComLex and CUVPlus (which I can't find a download link for at the moment, although you can find it mentioned in many places).
Check out the work by Timothy Baldwin and Francis Bond in 2003 on learning noun countability from corpora. If you have many occurrences of an unfamiliar noun in a corpus, you can do fairly well at the task of figuring out whether this noun can possibly be a count noun, can possibly be a mass noun, etc. however individual instances are still be quite difficult to classify. If you have the sentence "the wug was white" and according to your lexicon "wug" can be either count or mass, there's not enough information in the immediate context to help you classify it.
I'm not sure if there is an 'official' dictionary saying if a noun is likely to be countable or not, but I can come up with two ways you could go about this:
Either assuming that a noun is likely to be uncountable if somebody put it in a 'list of mass nouns' or 'list of uncountable nouns' (you find quite a lot if you google for those phrases, for example this).
Or make a little corpus study and see how often the word is used in which way: searching "rice" in the Corpus of contemporary american English gives us 22265 hits, while the word "rices" is only found 69 times.
It depends on the context and whether the noun may have plural on its own. Different senses of the same word may differ, e.g.:
expectation: the feeling vs. what is being expected
salt: table salt vs. a type of a chemical element
Our API, GlobalNLP, returns the countability of nouns (among other things) in a particular context in this method: https://nlp.linguasys.com/docs/services/53fccbb15cfea30d9c48f8d6/operations/542a6da01c78d80a3cd6692a
Related
I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).
I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.
I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.
Are there any known ways (above and beyond statistical analysis, but not necessarily excluding it as being part of the solution) to relate sentences or concepts to one another using Natural Language Processing. Thus far I've only worked with NLTK and Stanford-NLP to aid in my project, but I am open to alternative open source solutions.
As an example take the following George Orwell essay (http://orwell.ru/library/essays/wiw/english/e_wiw). Suppose I gave the application the sentence
"What are George Orwell's opinions on writers."
or perhaps
"George Orwell believes writers enjoy writing to express their creativity, to make a point and for their egos."
Might yield lines from the essay like
"The aesthetic motive is very feeble in a lot of writers, but even a pamphleteer or writer of textbooks will have pet words and phrases which appeal to him for non-utilitarian reasons; or he may feel strongly about typography, width of margins, etc."
or
"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."
I understand that this is not easy and I may not achieve much accuracy, but I was hoping for ideas on what already exists and what I could try to start off, or at least get the best results possible based on what is already known and out there.
The simplest way of doing this might be using some distance functions (such as Cosine similarity) between your query sentence and the sentence pool. It's easy to implement. Create a vocabulary from the text collection and each sentence is represented as a vector. You can use TF-IDF to represent values in the vector, and calculate the cosine similarity between sentences, and get the highest scored sentence with respect to your query sentence.
Or you can build index from your corpus and use for example Lucene and let it do the work for you.
You may also consider using LSA (Latent Semantic Analysis) where you can get the similarity between sentences.
From what I understand from your question (and also your comment) is you are more interested in understanding the meaning of individual sentence and then equate with each other in proximity. Statistical approach, in my opinion, is more for "getting a feel" of the sentence than understanding it. In my opinion I would suggest deep parsing approach.
Deep parse the sentence, understand what roles the words play in the sentence, understand what the subject-verb-object model (left to right parsing and such techniques) and then have a vocabulary that helps you categorise the nouns and verbs.
e.g.
"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."
Parsing this sentence, lets you understand the subject of the sentence is "serious writers" (serious being an adjective, writers basically). In the verb form it states "are" (current state) and "interested". Each verb then points to some more vocabulary including adjectives. If you arrange this vocabulary in correct manner (and keep building it) I think you should get somewhere with your problem.
I have an algorithm (that I can't change) that outputs a list of phrases. These phrases are intended to be "topics". However, some of them are meaningless on their own. Take this list:
is the fear
freesat
are more likely to
first sight
an hour of
sue apple
depression and
itunes
How can I filter out those phrases that don't make sense on their own, to leave a list like the following?
freesat
first sight
sue apple
itunes
This will be applied to sets of phrases in many languages, but English is the priority.
It's got to be grammatically acceptable in that it can't rely on other words in the original sentence that it was extracted from; e.g. it can't end in 'and'.
Although this is still an underspecified question, it sounds like you want some kind of grammar checker. I suggest you try applying a part-of-speech tagger to each phrase, compile a list of patterns of POS tags that are acceptable (e.g. anything that ends in a preposition would be unacceptable) and use that to filter your input.
At a high level, it seems that phrases which were only nouns or adjective-noun combos would give much better results.
Examples:
"Blue Shirt"
"Happy People"
"Book"
First of all, this problem can be as complex as you want it to be. For third-party reading/solutions, I came across:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
http://research.microsoft.com/en-us/groups/nlp/
http://sharpnlp.codeplex.com/ (note the part of speech tagger)
If you need 100% accuracy, then I wouldn't write such a tool myself.
However, if the problem domain is limited...
I would start by throwing out conjunctions, prepositions, contractions, state-of-being verbs, etc. This is a fairly short list in English (and looks very similar to the stopwords which #HappyTimeGopher suggested).
After that, you could create a dictionary (as an indexed structure, of course) of all acceptable nouns and adjectives and compare each word in the raw phrases to that. Anything which didn't occur in the dictionary and occur in the correct sequence could be thrown out or ranked lower.
This could be useful if you were given 100 input values and wanted to select the best 5. Finding the values in the dictionary would mean that it's likely the word/phrase was good.
I've auto-generated such a dictionary before by building a raw index from thousands of documents pertaining to a vertical industry. I then spent a few hours with SQL and Excel stripping out problems easily spotted by a human. The resulting list wasn't perfect but it eliminated most of the blatantly dumb/pointless terminology.
As you may have guessed, none of this is foolproof, although checking adjective-to-noun sequence would help somewhat. Consider the case of "Greatest Hits" versus "Car Hits [Wall]".
Proper nouns (e.g. person names) don't work well with the dictionary approach, since it's probably not feasible to build a dictionary of all variations of given/surnames.
To summarize:
use a list of stopwords
generate a dictionary of words, classifying them with a part of speech(s)
run raw phrases through dictionary and stopwords
(optional) rank on how confident you are on a match
if needed, accept phrases which didn't violate known patterns (this would handle many proper nouns)
If you've access to the text these phrases were generated from, it may be easier to just create your own topic tags.
Failing that, I'd probably just remove anything that contained a stop word. See this list, for example:
http://www.ranks.nl/resources/stopwords.html
I wouldn't break out POS tagging or anything stronger for this.
It seems you could create a list that filters out three things:
Prepositions: https://en.wikipedia.org/wiki/List_of_English_prepositions
Conjunctions: https://en.wikipedia.org/wiki/Conjunction_(grammar)
Verb forms of to-be: http://www.englishplus.com/grammar/00000040.htm
If you filter on these things you'd get pretty far. Are you more concerned with false negatives or positives? If false negatives aren't a huge problem, this is how I would approach it.