[Caveat] This is not directly a programing question, but it is something that comes up so often in language processing that I'm sure it's of some use to the community.
Does anyone have a good list of uninteresting (English) words that have been tested by more then a casual look? This would include all prepositions, conjunctions, etc... words that may have semantic meaning, but are often frequent in every sentence, regardless of the subject. I've built my own lists from time to time for personal projects but they've been ad-hoc; I continuously add words that I forgotten as they come in.
These words are usually called stop words. The Wikipedia article contains much more information about them, including where to find some lists.
I think you mean stop words.
There's a few links to lists of stop words on Wikipedia, including this one.
Related
I wonder why words like "therefore" or "however" or "etc" are not included for instance.
Can you suggest a strategy to make this list automatically more general?
One obvious solution is to include every word that arises in all documents. However, maybe in some documents "therefore" cannot arise.
Just to be clear I am not talking about augment the list by including words of specific data sets. For instance, in some data sets, it may be interested to filter some proper names. I am not talking about this. I am talking about the inclusion of general words that can appear in any english text.
The problem with tinkering with a stop word list is that there is no good way to gather all texts about a certain topic and then automatically discard everything that occurs too frequent. It may lead to inadvertently removing just the topic that you were looking for – because in a limited corpus it occurs relatively frequent. Also, any list of stop words may already contain just the phrase you are looking for. As an example, automatically creating a list of 1980s music groups would almost certainly discard the group The The.
The NLTK documentation refers to where their stopword list came from as:
Stopwords Corpus, Porter et al.
However, that reference is not very well written. It seems to state this was part of the 1980's Porter Stemmer (PDF: http://stp.lingfil.uu.se/~marie/undervisning/textanalys16/porter.pdf; thanks go to alexis for the link), but this actually does not mention stop words. Another source states that:
The Porter et al refers to the original Porter stemmer paper I believe - Porter, M.F. (1980): An algorithm for suffix stripping. Program 14 (3): 130—37. - although the et al is confusing to me. I remember being told the stopwords for English that the stemmer used came from a different source, likely this one - "Information retrieval" by C. J. Van Rijsbergen (Butterworths, London, 1979).
https://groups.google.com/forum/m/#!topic/nltk-users/c8GHEA8mq8A
The full text of Van Rijsbergen can be found online (PDF: http://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf); it mentions several approaches to preprocessing text and so may well be worth a full read. From a quick glance-through it seems the preferred algorithm to generate a stop word list goes all the way back to research such as
LUHN, H.P., 'A statistical approach to mechanised encoding and searching of library information', IBM Journal of Research and Development, 1, 309-317 (1957).
dating back to the very early stages of automated text processing.
The title of your question asks about the criteria that were used to compile the stopwords list. A look at stopwords.readme() will point you to the Snowball source code, and based on what I read there I believe the list was basically hand-compiled, and its primary goal was the exclusion of irregular word forms in order to provide better input to the stemmer. So if some uninteresting words were excluded, it was not a big problem for the system.
As for how you could build a better list, that's a pretty big question. You could try computing a TF-IDF score for each word in your corpus. Words that never get a high tf-idf score (for any document) are uninteresting, and can go in the stopword list.
I need to implement some sort of stemmer/lemmatizer. I have some words in different forms (a few thousands). It's not a morphological dictionary, just a small part of it. Is it a good idea to learn a stemmer automatically from the file a have? Is there any open-source implementations that can be used?
Nuve is an NLP library for Turkic languages. Once the language rules and data are prepared, it can analyze and generate words for any Turkic language if not for any agglutinative language. You can fork it and prepare new orthography and morphology files for azeri.
https://github.com/hrzafer/nuve
Since I'm the author, I'd be glad to help you with the process.
Azerbaijani is an agglutinative language, similar to Turkish, which means words frequently have a chain of suffixes (e.g. one suffix for plural and one of accusative). Also it has vowel harmony, which means each suffix has several variants and you choose the correct one based on the vowels in the root.
What I would do:
identify a list of suffixes. I would try both unsupervised methods (?maybe try Linguistica?), and googling for a list of suffixes (these will often contain only a basic suffix which changes depending on vowel harmony). Iteratively you should arrive to some reasonable list. If in doubt if something is a suffix or not, I would throw it in.
Use the list to strip suffixes from words.
The resulting stemmer will be noisy, but depending on what you need it for, it might not matter.
You should look at Linguistica which has been developed by John Goldsmith and his team (#UChicago) for this purpose.
Are you talking about English? Then please see
English lemmatizer databases?. Considering the significant amount of exceptions, a machine-learning approach without a large dictionary does not seem promising.
I am working on a project on web intelligence in which I have to build a system which accepts user query and extract meaningful keywords. Say for example user enters a query "How to do socket programming in Java", then I have to ignore "how", "to", "do", "in" and take "socket", "programming", "java" for further processing and clustering e.g. socket and programming are two different meaningful keyword but can be used together as keyword which produce different meaning. I am looking for some algorithm like TF-IDF to approach this problem. Any help will be appreciated.
Well what you are looking into a text analytics solution.
I have only used R for this purpose but one way to look at it is you need a list of words that you consider not meaningful keywords, this is often called "stop words". You can find online lists of stop words for almost any popular language. After doing this you might want to get a couple hundred inputs and calculate the frequency of every keyword there (having already removed stop words, as well as punctuation and having all text in lower-case) and try to identify other keywords that you think are irrelevant and add them to your list of words to remove.
After this there are a ton of options you can explore; an example would be stemming which is getting the core term of each word so that "pages" and "page" are considered the same keyword. (as you go deeper you will find a ton of stuff online to fine-tune your approach)
Hope this helps.
I have an algorithm (that I can't change) that outputs a list of phrases. These phrases are intended to be "topics". However, some of them are meaningless on their own. Take this list:
is the fear
freesat
are more likely to
first sight
an hour of
sue apple
depression and
itunes
How can I filter out those phrases that don't make sense on their own, to leave a list like the following?
freesat
first sight
sue apple
itunes
This will be applied to sets of phrases in many languages, but English is the priority.
It's got to be grammatically acceptable in that it can't rely on other words in the original sentence that it was extracted from; e.g. it can't end in 'and'.
Although this is still an underspecified question, it sounds like you want some kind of grammar checker. I suggest you try applying a part-of-speech tagger to each phrase, compile a list of patterns of POS tags that are acceptable (e.g. anything that ends in a preposition would be unacceptable) and use that to filter your input.
At a high level, it seems that phrases which were only nouns or adjective-noun combos would give much better results.
Examples:
"Blue Shirt"
"Happy People"
"Book"
First of all, this problem can be as complex as you want it to be. For third-party reading/solutions, I came across:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
http://research.microsoft.com/en-us/groups/nlp/
http://sharpnlp.codeplex.com/ (note the part of speech tagger)
If you need 100% accuracy, then I wouldn't write such a tool myself.
However, if the problem domain is limited...
I would start by throwing out conjunctions, prepositions, contractions, state-of-being verbs, etc. This is a fairly short list in English (and looks very similar to the stopwords which #HappyTimeGopher suggested).
After that, you could create a dictionary (as an indexed structure, of course) of all acceptable nouns and adjectives and compare each word in the raw phrases to that. Anything which didn't occur in the dictionary and occur in the correct sequence could be thrown out or ranked lower.
This could be useful if you were given 100 input values and wanted to select the best 5. Finding the values in the dictionary would mean that it's likely the word/phrase was good.
I've auto-generated such a dictionary before by building a raw index from thousands of documents pertaining to a vertical industry. I then spent a few hours with SQL and Excel stripping out problems easily spotted by a human. The resulting list wasn't perfect but it eliminated most of the blatantly dumb/pointless terminology.
As you may have guessed, none of this is foolproof, although checking adjective-to-noun sequence would help somewhat. Consider the case of "Greatest Hits" versus "Car Hits [Wall]".
Proper nouns (e.g. person names) don't work well with the dictionary approach, since it's probably not feasible to build a dictionary of all variations of given/surnames.
To summarize:
use a list of stopwords
generate a dictionary of words, classifying them with a part of speech(s)
run raw phrases through dictionary and stopwords
(optional) rank on how confident you are on a match
if needed, accept phrases which didn't violate known patterns (this would handle many proper nouns)
If you've access to the text these phrases were generated from, it may be easier to just create your own topic tags.
Failing that, I'd probably just remove anything that contained a stop word. See this list, for example:
http://www.ranks.nl/resources/stopwords.html
I wouldn't break out POS tagging or anything stronger for this.
It seems you could create a list that filters out three things:
Prepositions: https://en.wikipedia.org/wiki/List_of_English_prepositions
Conjunctions: https://en.wikipedia.org/wiki/Conjunction_(grammar)
Verb forms of to-be: http://www.englishplus.com/grammar/00000040.htm
If you filter on these things you'd get pretty far. Are you more concerned with false negatives or positives? If false negatives aren't a huge problem, this is how I would approach it.
As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.
I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.
Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?
(My initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).
It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.
Many thanks for any links or suggestions :)
Would this be useful?
Have you tried /usr/share/dict/words?
Create text list manually
Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.
Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.
Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.
Use Wordnet
Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.