Is there an open-source self-learning stemmer?

Is there an open-source self-learning stemmer? - nlp

I need to implement some sort of stemmer/lemmatizer. I have some words in different forms (a few thousands). It's not a morphological dictionary, just a small part of it. Is it a good idea to learn a stemmer automatically from the file a have? Is there any open-source implementations that can be used?

Nuve is an NLP library for Turkic languages. Once the language rules and data are prepared, it can analyze and generate words for any Turkic language if not for any agglutinative language. You can fork it and prepare new orthography and morphology files for azeri.
https://github.com/hrzafer/nuve
Since I'm the author, I'd be glad to help you with the process.

Azerbaijani is an agglutinative language, similar to Turkish, which means words frequently have a chain of suffixes (e.g. one suffix for plural and one of accusative). Also it has vowel harmony, which means each suffix has several variants and you choose the correct one based on the vowels in the root.
What I would do:
identify a list of suffixes. I would try both unsupervised methods (?maybe try Linguistica?), and googling for a list of suffixes (these will often contain only a basic suffix which changes depending on vowel harmony). Iteratively you should arrive to some reasonable list. If in doubt if something is a suffix or not, I would throw it in.
Use the list to strip suffixes from words.
The resulting stemmer will be noisy, but depending on what you need it for, it might not matter.

You should look at Linguistica which has been developed by John Goldsmith and his team (#UChicago) for this purpose.

Are you talking about English? Then please see
English lemmatizer databases?. Considering the significant amount of exceptions, a machine-learning approach without a large dictionary does not seem promising.

Related

How to extract meaning of colloquial phrases and expressions in English

I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.

The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.
Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.
Stemmers
[in]: having
[out]: hav
Lemmatizers
[in]: having
[out]: have
So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English
If not, then how should we move on to build robust lemmatizers that
can take on nounify, verbify, adjectify and adverbify
preprocesses?
How could the lemmatization task be easily scaled to other languages
that have similar morphological structures as English?

Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"
Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving with driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.
Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify, and adverbify preprocesses?
What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?
If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.
If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, and some might want it fined-grained.
Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"
What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).
With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although a combination of machine learning and manually created rules has been done too (see e.g. this).
Obviously, for most languages, you do not want to create the lookup table by
hand, but instead, generate it from a description of the morphology of
that language. For inflectional languages, you can go the engineering
way of Hajic for Czech or Mikheev for Russian, or, if you are daring,
you use two-level morphology. Or you can do something in between,
such as Hana (myself) (Note that these are all full
morphological analyzers that include lemmatization as one of their features). Or you can learn
the lemmatizer in an unsupervised manner a la Yarowsky and
Wicentowski, possibly with manual post-processing, correcting the
most frequent words.
There are way too many options and it really all depends on what you want to do with the results.

One classical application of either stemming or lemmatization is the improvement of search engine results: By applying stemming (or lemmatization) to the query as well as (prior to indexing) to all tokens indexed, users searching for, say, "having" are able to find results containing "has".
(Arguably, verbs are somewhat uncommon in most search queries, but the same principle applies to nouns, especially in languages with a rich noun morphology.)
For the purpose of search result improvement, it is not actually important whether the stem (or lemma) is meaningful ("have") or not ("hav"). It only needs to able to represent the word in question, and all its inflectional forms. In fact, some systems use numbers or other kinds of id-strings instead of either stem or lemma (or base form or whatever it may be called).
Hence, this is an example of an application where stemmers (by your definition) are as good as lemmatizers.
However, I am not quite convinced that your (implied) definition of "stemmer" and "lemmatizer" are generally accepted. I am not sure if there is any generally accepted definition of these terms, but the way I define them is as follows:
Stemmer: A function that reduces inflectional forms to stems or base forms, using rules and lists of known suffixes.
Lemmatizer: A function that performs the same reduction, but using a comprehensive full-form dictionary to be able to deal with irregular forms.
Based on these definitions, a lemmatizer is essentially a higher-quality (and more expensive) version of a stemmer.

The answer is highly dependent on the task or specific field of study within the Natural Language Processing (NLP) that we are talking about.
It is worth pointing out that it has been proved that in some specific tasks, like Sentiment Analysis (that is a favorite sub-field in NLP), using a Stemmer or Lemmatizer as a feature in the development of a system (training a machine learning model) does not have a noticeable effect on the accuracy of the model no matter how great the tool is. Even though it makes the performance a little bit better, there are more important features like Dependency parsing that have a considerable potential to be worked on in such systems.
It is important to mention that the characteristics of the language which we are working on should also be taken into the consideration.

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.

list of english verbs and their tenses, various forms, etc

Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)?
I imagine this will be useful for NLP systems, but there doesn't seem to be a listing anywhere, or it could be my terrible googling skills. Does anybody have a clue otherwise?

Consider Catvar:
A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).

I am not sure what you are looking for but I think WordNet -- a lexical database for the English language -- would be a good place to start. Read more at http://wordnet.princeton.edu/
The link I referred to you says that
WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

Considering getting a dump of wiktionary and extracting this information out of it.
http://en.wiktionary.org/wiki/sell mentions many of the forms of the word (sells, selling, sold).
If your aim is simply to normalize words to some base canonical form, considering using a lemmatizer or stemmer. Trying playing with morpha which is a really good english lemmatizer.

List of uninteresting words

[Caveat] This is not directly a programing question, but it is something that comes up so often in language processing that I'm sure it's of some use to the community.
Does anyone have a good list of uninteresting (English) words that have been tested by more then a casual look? This would include all prepositions, conjunctions, etc... words that may have semantic meaning, but are often frequent in every sentence, regardless of the subject. I've built my own lists from time to time for personal projects but they've been ad-hoc; I continuously add words that I forgotten as they come in.

These words are usually called stop words. The Wikipedia article contains much more information about them, including where to find some lists.

I think you mean stop words.
There's a few links to lists of stop words on Wikipedia, including this one.

Natural English language words

I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.
There are 1,000,000 words in the English language including foreign and/or technical words.
Can you please suggest such a source (or close to 500k words) that can be downloaded from the internet that is maybe a bit categorized? What input do you use for your language processing applications?

Kevin's wordlists is the best I know just for lists of words.
WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.

`The "million word" hoax rolls along', I see ;-)
How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.

I did research for Purdue on controlled / natural english and language domain knowledge processing.
I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.
You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.
You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.

Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.

Try directly Wikipedia's extracts : http://dbpedia.org

There aren't too many base words(171k according to this- oxford. Which is what I remember being told in my CS program in college.
But if include all forms of the words- then it rises considerably.
That said, why not make one yourself? Get a Wikipedia dump and parse it and create a set of all tokens you encounter.
Expect misspellings though- like all things crowd-sources there will be errors.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string