Is there any open source/free software available that gives you semantically related keywords for a given word. for example the word dog: it should give the keywords like: animal, mammal, ...
or for the word France it should give you keywords like: country, Europe ... .
basically a set of keywords related to the given word.
or if there is not, has anybody an idea of how this could be implemented and how complex this would be.
best regards
Wordnet might be what you need. Wordnet groups English words in sets of synonyms and provides general definitions, and records the various semantic relations between these groups.
There are tons of projects out there using Wordnet, here you have a list:
http://wordnet.princeton.edu/wordnet/related-projects/
Look at this one, you might find it particularly useful (http://kylescholz.com)
you can see the live demo here :
http://kylescholz.com/projects/wordnet/?text=dog
I hope this helps.
Yes. A company named Saplo in Sweden specialize in this. I beleive you can use their API for this and if you ask nicely you might be able to use it for free (if it's not for commercial purposes of course).
Saplo
Yes. What you are looking for is something similar to vector space model for searching and it is the best efficient way of doing. There are some open source libraries available for latent semantic indexing / searching ( special case of vector space model). Apache Lucene is one of the most pupular one. Or something from google code.
If you are looking for online resources, there are several to consider (at least in 2017; the OP is dated 2010).
Semantic Link (http://www.semantic-link.com): The creator of Semantic Link offers an interface to the results of a computation of a metric called "mutual information" on pairs of words over all of English Wikipedia. Only words occurring more than 1000 times in Wikipedia are available.
"Dog" gets you, for example: purebred, breeds, canine, pet, puppies.
It seems, however, you are really looking for an online tool that gives hyponyms and hypernyms. From the Wikipedia page for "Hyponymy and hypernymy":
In linguistics, a hyponym (from Greek hupó, "under" and ónoma, "name") is a word or phrase whose semantic field is included within that of another word, its hyperonym or hypernym (from Greek hupér, "over" and ónoma, "name") . In simpler terms, a hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle and seagull are all hyponyms of bird (their hyperonym); which, in turn, is a hyponym of animal.
WordNet(https://wordnet.princeton.edu) has this information and has an online search tool. With this tool, if you enter a word, you'll get one or more entries with an "S" beside them. If you click the "S", you can browse the "Synset (semantic) relations" of the word with that meaning or usage and this includes direct hyper- and hyponyms. It's incredibly rich!
For example: "dog" (as in "domestic dog") --> "canine" --> "carnivore" --> "placental mammal" --> "vertebrate" --> "chordate" --> etc. or "dog" --> "domestic animal" --> "animal" --> "organism" --> "living thing" -->
There is also WordNik which lists hypernyms and reverse dictionary words (words with the given word in their definition). Hypernyms for "France" include "european country/nation" and reverse dictionary includes regions and cities in France, names of certain rulers, etc.. "Dog" gets the hypernym "domesticated animal" (and others).
Related
What is the etymology for JJ tag denoting POS for adjectives? I am unable to find any references online. There are several resources listing all the tags, but none describing the reason.
It may be impossible to get an official answer. JJ has been used since the Brown corpus, and appears without comment in publications going back to at least 1981 (just after publication of the 1979 Form C "revised and amplified" edition).
Per this record of the corpus, the main publication by the authors accompanying Form C is the manual, available here. It contains the list, with plenty of explanations of how words are classified and none for how the tags were made.
After reviewing Role of the Brown Corpus in the History of Corpus Linguistics (Olga Kholkovskaia, 2017), I agree that the authors generally focused on the massive compilation and tagging method over commentary. The 1967 classic "Computational analysis of present-day American English" is mostly frequency tables, with no instance of "adjective" or JJ in it. Thus, I found no publications where lead authors Wilson and Kucera discusss their choice of tags, and both passed away in the 2000s.
This limits us to speculation. The authors had 82 tags that needed to be short, memorable (the tagging process was partly manual), and allow various modifiers to be appended without creating confusion. Vowels are fairly useless for this, with every part of speech in the table containing at least one. Verb (VB) and noun (NN) go by first-and-last letters, while others may use initialisms (coordinating conjunction CC, foreign word FW), syllable initialisms (modal MD, predeterminer PDT), first letters (possessive POS), arbitrary associations (interjections UH).
Adjective's JJ is odd in using a letter absent from the phrase and does not make intuitive sense like UH, possessive P$, or plural S - but hardly the strangest tag choice, even in the reduced Penn Treebank table. Perhaps someone wanted to match NN's style, and doubled the first relatively uncommon letter in adjective. Any more detailed answer may only be possible by finding unpublished notes or still-living colleagues.
I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.
I am trying named entity recognition for the first time. I'm looking for features that will pick out English names. I am using the methods outlined in the coursera nlp course (week three) and the nltk book. In other words: I am defining features, identifying features of words and then running those words/features through a classifier that I train on labeled data.
What features are used to pick out English names?
I can imagine that you'd look for two capital words in a row, or a capital word and then an initial and then a capital word. (ex. John Smith or James P. Smith).
But what other features are used for NER?
Some common features:
Word lists for common names (John, Adam, etc)
casing
contains symbol or numeric characters (names generally don't)
person prefixes (Mr., Mrs., etc...)
person postfixes (Jr., Sr., etc...)
single letter abbreviation (ie, (J.) Smith).
analysis of surrounding words (you may find some words have a high probability of appearing near names).
Named Entities previously recognized (often it is easy to identify NE in some parts of the corpus based on context, but very hard in other parts. If previously identified, this is an excellent hint towards NER)
Depending what language you are working with there may be more language specific features as well. Frankly you can turn up a wealth of information with a simple Google query, I'm really not sure why you haven't turned there. Some starting points however:
Google
A survey of named entity recognition and classification
Named entity recognition without gazetteers
I had done something similar back in school using machine learning. I suppose that you will use a supervised algorithm and you will classify every single word independently and not words in combination. In that case I would choose some features for the word itself like the ones you mentioned (if the word begins with a capital letter, if the word is an abbreviation) but I would add some more features like if the previous or the next words also start from a capital letter, or if they are abbreviations. This way you can add some context and overcome the problems related to your basic independence assumption.
If you want have a look here. In the machine learning section you can find some more information and examples (the problem is slightly different but the method should be similar).
Whatever features you choose it is important that you use some measure to evaluate their relevance and possibly reduce them to the useful ones to avoid over-fitting. One of the measures you can use to evaluate them is the gain ratio but there are many more. Here you can find some basic information about feature extraction.
Hope it helps!
Are there any recourses on determining the countability of nouns? Either some way to work it out or a dictionary that records whether a noun is likely to countable or not countable?
I'm not interested in whether the noun can be countable but more is it likely to be countable. for instance rice can go to rices which means it can be countable but in most cases it wont be.
This is a tough one. Many English words can be both (beer, time, glass, language, etc etc) depending on the context/meaning.
Figuring out (un)countability from the word alone or from a regular dictionary is impossible or impractical.
You can try to figure it out from a large text corpus by seeing how the word is used:
if there's a plural form or not
if there's an indefinite article before it or none
if it's used with many/few, much/little, a piece of(?), etc
But many words can function as both nouns and adjectives and that complicates matters. For example in an air pump, air functions as an adjective and an refers to pump, not to air.
Likewise, many words can function as both nouns and verbs and have identical forms. For example, in she pressures him, pressures isn't a plural of pressure.
Also, some uncountable nouns can have an indefinite article before them when they are made more specific, e.g. knowledge vs a good practical knowledge.
You can gather statistics from an analyzed corpus and based on it judge whether or not a word is more likely to be countable or uncountable.
There are several existing English lexica that contain information about count/mass/etc. distinctions, none of which quite agree with each other because they focus on slightly different distinctions and it's a complicated task. Two are ComLex and CUVPlus (which I can't find a download link for at the moment, although you can find it mentioned in many places).
Check out the work by Timothy Baldwin and Francis Bond in 2003 on learning noun countability from corpora. If you have many occurrences of an unfamiliar noun in a corpus, you can do fairly well at the task of figuring out whether this noun can possibly be a count noun, can possibly be a mass noun, etc. however individual instances are still be quite difficult to classify. If you have the sentence "the wug was white" and according to your lexicon "wug" can be either count or mass, there's not enough information in the immediate context to help you classify it.
I'm not sure if there is an 'official' dictionary saying if a noun is likely to be countable or not, but I can come up with two ways you could go about this:
Either assuming that a noun is likely to be uncountable if somebody put it in a 'list of mass nouns' or 'list of uncountable nouns' (you find quite a lot if you google for those phrases, for example this).
Or make a little corpus study and see how often the word is used in which way: searching "rice" in the Corpus of contemporary american English gives us 22265 hits, while the word "rices" is only found 69 times.
It depends on the context and whether the noun may have plural on its own. Different senses of the same word may differ, e.g.:
expectation: the feeling vs. what is being expected
salt: table salt vs. a type of a chemical element
Our API, GlobalNLP, returns the countability of nouns (among other things) in a particular context in this method: https://nlp.linguasys.com/docs/services/53fccbb15cfea30d9c48f8d6/operations/542a6da01c78d80a3cd6692a
We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.