I am currently using the wordnet search to get the meaning of words. However, I have a really long list of words and thus, would like to see the possibility of automating it.
For example given the individual word goat I want to get the meaning of it provide by wordnet.
I see questions about getting root word, hyponyms etc, but I could not find a proper solution on how to retrieve the meaning given a word.
Please let me know the possible options of doing it!
Here is how to get definition
from nltk.corpus import wordnet
syns = wordnet.synsets("goat")
print(syns[0].definition())
Output
any of numerous agile ruminants related to sheep but having a beard and straight horns
Related
I have a list of let's say "forbidden sentences" (1000 of them, each with around 40 words). I want to create a tool that will find and mark them in a given document.
The problem is that in such document this forbidden sentence can be expressed differently than it is on this list keeping the same meaning but changed by using synonyms, a few words more or less, different word order, punctuation, grammar etc. The fact that this is all in Polish is not making things easier with each noun, pronoun, and adjective having 14 cases in total plus modifiers and gender that changes the words further. I was also thinking about making it so that the found sentences are ranked by the probability of them being forbidden with some displaying less resemblance.
I studied IT for two years but I don't have much knowledge in NLP. Do you think this is possible to be done by an amateur? Could you give me some advice on where to start, what tools to use best to put it all together? No need to be fancy, just practical. I was hoping to find some ready to use code cause i imagine this is sth that was made before. Any ideas where to find such resources or what keywords to use while searching? I'd really appreciate some help cause I'm very new to this and need to start with the basics.
Thanks in advance,
Kamila
Probably the easiest first try will be to use polish SpaCy, which is an extension of popular production-ready NLP library to support polish language.
http://spacypl.sigmoidal.io/#home
You can try to do it like this:
Split document into sentences.
Clean these sentences with spacy (deleting stopwords, punctuation, doing lemmatization - it will help you with many differnet versions of the same word)
Clean "forbidden sentences" as well
Prepare vector representation of each sentence - you can use spaCy methods
Calculate similarity between sentences - cosine similarity
You can set threshold, from which if sentences of document is similar to any of "forbidden sentences" it will be treated as forbidden
If anything is not clear let me know.
Good luck!
I wonder why words like "therefore" or "however" or "etc" are not included for instance.
Can you suggest a strategy to make this list automatically more general?
One obvious solution is to include every word that arises in all documents. However, maybe in some documents "therefore" cannot arise.
Just to be clear I am not talking about augment the list by including words of specific data sets. For instance, in some data sets, it may be interested to filter some proper names. I am not talking about this. I am talking about the inclusion of general words that can appear in any english text.
The problem with tinkering with a stop word list is that there is no good way to gather all texts about a certain topic and then automatically discard everything that occurs too frequent. It may lead to inadvertently removing just the topic that you were looking for – because in a limited corpus it occurs relatively frequent. Also, any list of stop words may already contain just the phrase you are looking for. As an example, automatically creating a list of 1980s music groups would almost certainly discard the group The The.
The NLTK documentation refers to where their stopword list came from as:
Stopwords Corpus, Porter et al.
However, that reference is not very well written. It seems to state this was part of the 1980's Porter Stemmer (PDF: http://stp.lingfil.uu.se/~marie/undervisning/textanalys16/porter.pdf; thanks go to alexis for the link), but this actually does not mention stop words. Another source states that:
The Porter et al refers to the original Porter stemmer paper I believe - Porter, M.F. (1980): An algorithm for suffix stripping. Program 14 (3): 130—37. - although the et al is confusing to me. I remember being told the stopwords for English that the stemmer used came from a different source, likely this one - "Information retrieval" by C. J. Van Rijsbergen (Butterworths, London, 1979).
https://groups.google.com/forum/m/#!topic/nltk-users/c8GHEA8mq8A
The full text of Van Rijsbergen can be found online (PDF: http://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf); it mentions several approaches to preprocessing text and so may well be worth a full read. From a quick glance-through it seems the preferred algorithm to generate a stop word list goes all the way back to research such as
LUHN, H.P., 'A statistical approach to mechanised encoding and searching of library information', IBM Journal of Research and Development, 1, 309-317 (1957).
dating back to the very early stages of automated text processing.
The title of your question asks about the criteria that were used to compile the stopwords list. A look at stopwords.readme() will point you to the Snowball source code, and based on what I read there I believe the list was basically hand-compiled, and its primary goal was the exclusion of irregular word forms in order to provide better input to the stemmer. So if some uninteresting words were excluded, it was not a big problem for the system.
As for how you could build a better list, that's a pretty big question. You could try computing a TF-IDF score for each word in your corpus. Words that never get a high tf-idf score (for any document) are uninteresting, and can go in the stopword list.
I am working on an NLTK project, intended in principle to be like a standard thesaurus but (quasi-)continuous. To take one example, there are dozens of entries connected with books, including both religious classics and ledgers.
I tried fiddling with some terms, but I seemed to get just a smaller slice of the pie by doing that. (A "ledger" result contained "daybook" but the substances was a much smaller collection than one would find by reading a book.) The discussion of "synsets" in the documentation seem to imply both that you can find terms close to an existing terms, but the synsets are like islands, or see such to me.
What (if any) means are there to say something like "I want all words with a high match score above XYZ threshold" or "I want to match the n closest related terms." The documentation looks like this is possible, with a really nice way of calculating a proximity score between two words, but don't see how to adjust the threshold or alternately how to request the n closest matches.
What are my best bets here?
If you want to be able to compute distance between arbitrary pairs of words, WordNet is the wrong tool for the job: It is a network of particular terms, so either there is a path between two nodes or there is not. Look around for corpus-based measures instead.
A quick google gave this thread (not on SO) that could serve as a starting point.
In the nltk, I would start by taking a look at nltk.text.ContextIndex, which seems to be behind the nltk demo function nltk.Text.similar(). It won't calculate distances between pairs of words, but at least you'll have a rich network of contexts you can start from.
>>> contexts = nltk.text.ContextIndex(nltk.corpus.brown.words()[:100000])
>>> contexts.similar_words("fact")
['jury', 'announcement', 'Washington', 'addition', '1961', 'impression',
'news', 'belief', 'commissioners', 'Laos', 'return', '1959', '1960', '1956',
'result', 'University', 'opinion', 'work', 'course', 'hope']
I'll leave it to you to remove punctuation, stopwords etc. I haven't looked at the algorithms behind this, but you can always implement your own favorite algorithm if this doesn't do the job for you.
How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.
We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.