How can you detect / find out the meaning (the extension) of an acronym using NLP / Information Extraction (IE) methods?
We want to detect in free text if a word or it's acronym is used and map it to the same entity / token.
Most papers available online are about medical acronyms and they do not provide a library for acomplish this task.
Any ideas?
Reading your question and the comments I understand that you want to create a mapping from an acronym to its extension.
Assuming you have a collection of textual documents where both the acronym and its expansion occur you can apply an algorithm to extract (acronym,extension) pairs.
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text by A.S Schwartz and M.A. Hearst, does exactly this by looking at patterns. The Java implementation is available here.
I applied this algorithm to the English Wikipedia, you can see the results here. I also applied it to a collection of Portuguese new articles, results are here.
Wordnet contains acronym for tons of words which you can use in variety of programming languages: http://wordnet.princeton.edu/wordnet/
Or get from Freebase. See this: What is one way to find related names using the web?
Related
I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).
I'm recently interested in NLP, and would like to build up search engine for product recommendation. (Actually I'm always wondering about how search engine for Google/Amazon is built up)
Take Amazon product as example, where I could access all "word" information about one product:
Product_Name Description ReviewText
"XXX brand" "Pain relief" "This is super effective"
By applying nltk and gensim packages I could easily compare similarity of different products and make recommendations.
But here's another question I feel very vague about:
How to build a search engine for such products?
For example, if I feel pain and would like to search for medicine online, I'd like to type-in "pain relief" or "pain", whose searching results should include "XXX brand".
So this sounds more like keyword extraction/tagging question? How should this be done in NLP? I know corpus should contain all but single words, so it's like:
["XXX brand" : ("pain", 1),("relief", 1)]
So if I typed in either "pain" or "relief" I could get "XXX brand"; but what about I searched "pain relief"?
I could come up with idea that directly call python in my javascript for calculate similarities of input words "pain relief" on browser-based server and make recommendation; but that's kind of do-able?
I still prefer to build up very big lists of keywords at backends, stored in datasets/database and directly visualized in web page of search engine.
Thanks!
Even though this does not provide a full how-to answer, there are two things that might be helpful.
First, it's important to note that Google does not only treat singular words but also ngrams.
More or less every NLP problem and therefore also information retrieval from text needs to tackle ngrams. This is because phrases carry way more expressiveness and information than singular tokens.
That's also why so called NGramAnalyzers are popular in search engines, be it Solr or elastic. Since both are based on Lucene, you should take a look here.
Relying on either framework, you can use a synonym analyser that adds for each word the synonyms you provide.
For example, you could add relief = remedy (and vice versa if you wish) to your synonym mapping. Then, both engines would retrieve relevant documents regardless if you search for "pain relief" or "pain remedy". However, you should probably also read this post about the issues you might encounter, especially when aiming for phrase synonyms.
Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)?
I imagine this will be useful for NLP systems, but there doesn't seem to be a listing anywhere, or it could be my terrible googling skills. Does anybody have a clue otherwise?
Consider Catvar:
A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).
I am not sure what you are looking for but I think WordNet -- a lexical database for the English language -- would be a good place to start. Read more at http://wordnet.princeton.edu/
The link I referred to you says that
WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
Considering getting a dump of wiktionary and extracting this information out of it.
http://en.wiktionary.org/wiki/sell mentions many of the forms of the word (sells, selling, sold).
If your aim is simply to normalize words to some base canonical form, considering using a lemmatizer or stemmer. Trying playing with morpha which is a really good english lemmatizer.
We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.
I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.
There are 1,000,000 words in the English language including foreign and/or technical words.
Can you please suggest such a source (or close to 500k words) that can be downloaded from the internet that is maybe a bit categorized? What input do you use for your language processing applications?
Kevin's wordlists is the best I know just for lists of words.
WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.
`The "million word" hoax rolls along', I see ;-)
How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.
I did research for Purdue on controlled / natural english and language domain knowledge processing.
I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.
You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.
You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.
Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.
Try directly Wikipedia's extracts : http://dbpedia.org
There aren't too many base words(171k according to this- oxford. Which is what I remember being told in my CS program in college.
But if include all forms of the words- then it rises considerably.
That said, why not make one yourself? Get a Wikipedia dump and parse it and create a set of all tokens you encounter.
Expect misspellings though- like all things crowd-sources there will be errors.