Differences between lexical features and orthographic features in NLP?

Differences between lexical features and orthographic features in NLP? - nlp

Features are used for model training and testing. What are the differences between lexical features and orthographic features in Natural Language Processing? Examples preferred.

I am not aware of such a distinction, and most of the time when people talk about lexical features they talk about using the word itself, in contrast to only using other features, ie its part-of-speech.
Here is an example of a paper that means "whole word orthograph" when they say lexical features
One could venture that orthographic could mean something more abstract than the sequence of characters themselves, for example whether the sequence is capitalized / titlecased / camelcased / etc. But we already have the useful and clearly understood shape feature denomination for that.
As such, I would recommend distinguishing features like this:
lexical features:
whole word, prefix/suffix (various lengths possible), stemmed word, lemmatized word
shape features:
uppercase, titlecase, camelcase, lowercase
grammatical and syntactic features:
POS, part of a noun-phrase, head of a verb phrase, complement of a prepositional phrase, etc...
This is not an exhaustive list of possible features and feature categories, but it might help you categorizing linguistic features in a clearer and more widely-accepted way.

Related

Chomsky hierarchy - examples with real languages

I'm trying to understand the four levels of the Chomsky hierarchy by using some real languages as models. He thought that all the natural languages can be generated through a Context-free Grammar, but Schieber contradicted this theory proving that languages such as Swiss German can only be generated through Context-sensitive grammar. Since Chomsky is from US, I guess that the American language is an example of Context-free grammar. My questions are:
Are there languages which can be generated by regular grammars (type 3)?
Since the Recursively enumerable grammars can generate all languages, why not using that? Are they too complicated and less linear?
What the characteristic of Swiss German which make it impossible to be generated through Context-free grammars?

I don't think this is an appropriate question for StackOverflow, which is a site for programming questions. But I'll try to address it as best I can.
I don't believe Chomsky was ever under the impression that natural languages could be described with a Type 2 grammar. It is not impossible for noun-verb agreement (singular/plural) to be represented in a Type 2 grammar, because the number of cases is finite, but the grammar is awkward. But there are more complicated features of natural language, generally involving specific rules about how word order can be rearranged, which cannot be captured in a simple grammar. It was Chomsky's hope that a second level of analysis -- "transformational grammars" -- could useful capture these rearrangement rules without making the grammar computationally intractable. That would require finding some systematization which fit between Type 1 and Type 2, because Type 1 grammars are not computationally tractable.
Since we do, in fact, correctly parse our own languages, it stands to reason that there be some computational algorithm. But that line of reasoning might not actually be correct, because there is a limit to the complexity of a sentence which we can parse. Any finite language is regular (Type 3); only languages which have an unlimited number of potential sentences require more sophisticated grammars. So a large collection of finite patterns could suffice to understand natural language. These patterns might be a lot more sophisticated than regular expressions, but as long as each pattern only applies to a sentence of limited length, the pattern could be expressed mathematically as a regular expression. (The most obvious one is to just list all possible sentences as alternatives, which is a regular expression if the number of possible sentences is finite. But in many cases, that might be simplified into something more useful.)
As I understand it, modern attempts to deal with natural language using so-called "deep learning" are essentially based on pattern recognition through neural networks, although I haven't studied the field deeply and I'm sure that there are many complications I'm skipping over in that simple description.
Noam Chomsky is an American, but "American" is not a language (y si fuera, podría ser castellano, hablado por la mayoría de los residentes de las Americas). As far as I know, his first language is English, but he is not by any means unilingual, although I don't know how much Swiss German he speaks. Certainly, there have been criticisms over the years that his theories have an Indo-European bias. Certainly, I don't claim competence in Swiss German, despite having lived several years in Switzerland, but I did read Shieber's paper and some of the follow-ups and discussed them with colleagues who were native Swiss German speakers. (Opinions were divided.)
The basic issue has to do with morphological agreement in lists. As I mentioned earlier, many languages (all Indo-European languages, as far as I know) insist that the form of the verb agrees with the form of the subject, so that a singular subject requires a singular verb and a plural subject requires a plural verb. [Note 1]
In many languages, agreement is also required between adjectives and nouns, and this is not just agreement in number but also agreement in grammatical gender (if applicable). Also, many languages require agreement between the specific verb and the article or adjective of the object of the verb. [Note 2]
Simple agreement can be handled by a context-free (Type 2) grammar, but there is a huge restriction. To put it simply, a context-free grammar can only deal with parenthetic constructions. This can work even if there is more than one type of parenthesis, so a context-free grammar can insist that an [ be matched with a ] and not a ). But the grammar must have this "inside-out" form: the matching symbols must be in the reverse order to the symbols being matched.
One consequence of this is that there is a context-free grammar for palindromes -- sentences which read the same in both directions, which effectively means that they consist of a phrase followed by its reverse. But there is no context-free grammar for duplications: a language consisting of repeated phrases. In the palindrome, the matching words are in the reverse order to the matched words; in the duplicate, they are in the same order. Hence the difference.
Agreement in natural languages mostly follows this pattern, and some of the exceptions can be dealt with by positing simple rules for reordering finite numbers of phrases -- Chomsky's transformational grammar. But Swiss German features at least one case where agreement is not parenthetic, but rather in the same order. [Note 3] This involves the feature of German in which many sentences are in the order Subject-Object-Verb, which can be extended to Subject Object Object Object... Verb Verb Verb... when the verbs have indirect objects. Shieber showed some examples in which object-verb agreement is ordered, even when there are intervening phrases.
In the general case, such "cross-serial agreement" cannot be expressed in a context-free grammar. But there is a huge underlying assumption: that the length of the agreeing series be effectively unlimited. If, on the other hand, there are a finite number of patterns actually in common use, the "deep learning" model referred to above would certainly be able to handle it.
(I want to say that I'm not endorsing deep learning here. In fact, the way "artificial intelligence" is "trained" involves the uses of trainers whose cultural biases may well not be sufficiently understood. This could easily lead to the same unfortunate consequences alluded to in my first footnode.)
Notes
This is not the case in many native American languages, as Whorf pointed out. In those languages, using a singular verb with a plural noun implies that the action was taken collectively, while using a plural verb would imply that the action was taken separately. Roughly transcribed to English, "The dogs run" would be about a bunch of dogs independently running in different directions, whereas "The dogs runs" would be about a single pack of dogs all running together. Some European "teachers" who imposed their own linguistic prejudices on native languages failed to correctly understand this distinction, and concluded that the native Americans must be too primitive to even speak their own language "correctly"; to "correct" this "deficiency", they attempted to eliminate the distinction from the language, in some cases with success.
These rules, not present in English, are one of the reasons some English speakers are tortured by learning German. I speak from personal experience.
Ordered agreement, as opposed to parenthetic agreement, is known as cross-serial dependency.

Extracting <subject, predicate, object> triplet from unstructured text

I need to extract simple triplets from unstructured text. Usually it is of the form noun- verb- noun, so I have tried POS tagging and then extracting nouns and verbs from neighbourhood.
However it leads to lot of cases and gives low accuracy.
Will Syntactic/semantic parsing help in this scenario?
Will ontology based information extraction be more useful?

I expect that syntactic parsing would be the best fit for your scenario. Some trivial template-matching method with POS tags might work, where you find verbs preceded and followed by a single noun, and take the former to be the subject and the latter the object. However, it sounds like you've already tried something like that -- unless your neighborhood extraction ignores word order (which would be a bit silly - you'd be guessing which noun was the word and which was the object, and that's assuming exactly two nouns in each sentence).
Since you're looking for {s, v, o} triplets, chances are you won't need semantic or ontological information. That would be useful if you wanted more information, e.g. agent-patient relations or deeper knowledge extraction.
{s,v,o} is shallow syntactic information, and given that syntactic parsing is considerably more robust and accessible than semantic parsing, that might be your best bet. Syntactic parsing will be sensitive to simple word re-orderings, e.g. "The hamburger was eaten by John." => {John, eat, hamburger}; you'd also be able to specifically handle intransitive and ditransitive verbs, which might be issues for a more naive approach.

Why do I need a tokenizer for each language? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
When processing text, why would one need a tokenizer specialized for the language?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?

Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.
Chinese: 如果您在新加坡只能前往一间夜间娱乐场所，Zouk必然是您的不二之选。
English: If you only have time for one club in Singapore, then it simply has to be Zouk.
Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.
Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。
Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.
Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.
Text Source: http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf
The tokenized version of the parallel text above should look like this:
For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.
However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian (it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian as one token rather than thời and gian.
For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer is more useful than a tokenizer in Natural Language Processing.

Some languages, like Chinese, don't use whitespace to separate words at all.
Other languages will use punctuation differently - an apostrophe might or might not be a part of a word, for instance.
Case-folding rules vary from language to language.
Stopwords and stemming are different between languauges (though I guess I'm straying from tokenizer to analyzer here).
Edit by Bjerva: Additionally, many languages concatenate compound nouns. Whether this should be tokenised to several tokens or not can not be easily determined using only whitespace.

The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:
(Missing) Spaces between words
Many languages do not put spaces in between words at all, and so the
basic word division algorithm of breaking on whitespace is of no use
at all. Such languages include major East-Asian languages/scripts,
such as Chinese, Japanese, and Thai. Ancient Greek was also written by
Ancient Greeks without word spaces. Spaces were introduced (together
with accent marks, etc.) by those who came afterwards. In such
languages, word segmentation is a much more major and challenging
task. (MANNI:1999, p. 129)
Compounds
German compound nouns are written as a single word, e.g.
"Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust
agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)).
Their information-density, however, is high, and one may wish to
divide such a compound, or at least to be aware of the internal
structure of the word, and this becomes a limited word segmentation
task.(Ibidem)
There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.
Variant styles and codings
Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:
[...] Even if one is not dealing with multilingual text, any
application dealing with text from different countries or written
according to different stylistic conventions has to be prepared to
deal with typographical differences. In particular, some items such as
phone numbers are clearly of one semantic sort, but can appear in many
formats. (MANNI:1999, p. 130)
Misc.
One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).
If one wants to handle multilingual contractions/"cliticons":
English: They‘re my father‘s cousins.
French: Montrez-le à l‘agent!
German: Ich hab‘s ins Haus gebracht. (in‘s is still a valid variant)
Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:
Kiss, Tibor and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics32(4), p. 485-525.
Palmer, D. and M. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23(2), p. 241-267.
Reynar, J. and A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. Proceedingsof the Fifth Conference on Applied Natural Language Processing, p. 16-19.
References
(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.

What Is the Difference Between POS Tagging and Shallow Parsing?

I'm currently taking a Natural Language Processing course at my University and still confused with some basic concept. I get the definition of POS Tagging from the Foundations of Statistical Natural Language Processing book:
Tagging is the task of labeling (or tagging) each word in a sentence
with its appropriate part of speech. We decide whether each word is a
noun, verb, adjective, or whatever.
But I can't find a definition of Shallow Parsing in the book since it also describe shallow parsing as one of the utilities of POS Tagging. So I began to search the web and found no direct explanation of shallow parsing, but in Wikipedia:
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.
I frankly don't see the difference, but it may be because of my English or just me not understanding simple basic concept. Can anyone please explain the difference between shallow parsing and POS Tagging? Is shallow parsing often also called Shallow Semantic Parsing?
Thanks before.

POS tagging would give a POS tag to each and every word in the input sentence.
Parsing the sentence (using the stanford pcfg for example) would convert the sentence into a tree whose leaves will hold POS tags (which correspond to words in the sentence), but the rest of the tree would tell you how exactly these these words are joining together to make the overall sentence. For example an adjective and a noun might combine to be a 'Noun Phrase', which might combine with another adjective to form another Noun Phrase (e.g. quick brown fox) (the exact way the pieces combine depends on the parser in question).
You can see how parser output looks like at http://nlp.stanford.edu:8080/parser/index.jsp
A shallow parser or 'chunker' comes somewhere in between these two. A plain POS tagger is really fast but does not give you enough information and a full blown parser is slow and gives you too much. A POS tagger can be thought of as a parser which only returns the bottom-most tier of the parse tree to you. A chunker might be thought of as a parser that returns some other tier of the parse tree to you instead. Sometimes you just need to know that a bunch of words together form a Noun Phrase but don't care about the sub-structure of the tree within those words (i.e. which words are adjectives, determiners, nouns, etc and how do they combine). In such cases you can use a chunker to get exactly the information you need instead of wasting time generating the full parse tree for the sentence.

POS tagging is a process deciding what is the type of every token from a text, e.g. NOUN, VERB, DETERMINER, etc. Token can be word or punctuation.
Meanwhile shallow parsing or chunking is a process dividing a text into syntactically related group.
Pos Tagging output
My/PRP$ dog/NN likes/VBZ his/PRP$ food/NN ./.
Chunking output
[NP My Dog] [VP likes] [NP his food]

The Constraint Grammar framework is illustrative. In its simplest, crudest form, it takes as input POS-tagged text, and adds what you could call Part of Clause tags. For an adjective, for example, it could add #NN> to indicate that it is part of an NP whose head word is to the right.

In POS_tagger, we tag words using a "tagset" like {noun, verb, adj, adv, prob...}
while shallow parser try to define sub-components such as Name Entity and phrases in the sentence like
"I'm currently (taking a Natural (Language Processing course) at (my University)) and (still confused with some basic concept.)"

D. Jurafsky and J. H. Martin say in their book, that shallow parse (partial parse) is a parse that doesn't extract all the possible information from the sentence, but just extract valuable in the specific case information.
Chunking is just a one of the approaches to shallow parsing. As it was mentioned, it extracts only information about basic non-recursive phrases (e.g. verb phrases or noun phrases).
Other approaches, for example, produce flatted parse trees. These trees may contain information about part-of-speech tags, but defer decisions that may require semantic or contextual factors, such as PP attachments, coordination ambiguities, and nominal compound analyses.
So, shallow parse is the parse that produce a partial parse tree. Chunking is an example of such parsing.

What is the difference between lemmatization vs stemming?

When do I use each ?
Also...is the NLTK lemmatization dependent upon Parts of Speech?
Wouldn't it be more accurate if it was?

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
From the NLTK docs:
Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.

Lemmatisation is closely related to stemming. The difference is that a
stemmer operates on a single word without knowledge of the context,
and therefore cannot discriminate between words which have different
meanings depending on part of speech. However, stemmers are typically
easier to implement and run faster, and the reduced accuracy may not
matter for some applications.
For instance:
The word "better" has "good" as its lemma. This link is missed by
stemming, as it requires a dictionary look-up.
The word "walk" is the base form for word "walking", and hence this
is matched in both stemming and lemmatisation.
The word "meeting" can be either the base form of a noun or a form
of a verb ("to meet") depending on the context, e.g., "in our last
meeting" or "We are meeting again tomorrow". Unlike stemming,
lemmatisation can in principle select the appropriate lemma
depending on the context.
Source: https://en.wikipedia.org/wiki/Lemmatisation

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.

There are two aspects to show their differences:
A stemmer will return the stem of a word, which needn't be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.
In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
Reference http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general "term conflation" procedures, which may also address lexico-semantic, syntactic, or orthographic variations.
The real difference between stemming and lemmatization is threefold:
Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;
Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;
In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.
Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.

As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.
As for when you would use one or the other, it's a matter of how much your application depends on getting the meaning of a word in context correct. If you're doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you're doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.
As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it "dove" and "v" yields "dive" while "dove" and "n" yields "dove".

An example-driven explanation on the differenes between lemmatization and stemming:
Lemmatization handles matching “car” to “cars” along
with matching “car” to “automobile”.
Stemming handles matching “car” to “cars” .
Lemmatization implies a broader scope of fuzzy word matching that is
still handled by the same subsystems. It implies certain techniques
for low level processing within the engine, and may also reflect an
engineering preference for terminology.
[...] Taking FAST as an example,
their lemmatization engine handles not only basic word variations like
singular vs. plural, but also thesaurus operators like having “hot”
match “warm”.
This is not to say that other engines don’t handle synonyms, of course
they do, but the low level implementation may be in a different
subsystem than those that handle base stemming.
http://www.ideaeng.com/stemming-lemmatization-0601

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn't have any meaning.
Examples,
"beautiful" -> "beauti"
"corpora" -> "corpora"
Stemming can be done very quickly.
Lemmatization on the other hand, is the process of converting the given word into it's base form according to the dictionary meaning of the word.
Examples,
"beautiful" -> "beauty"
"corpora" -> "corpus"
Lemmatization takes more time than stemming.

I think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes
Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form

Huang et al. describes the Stemming and Lemmatization as the following. The selection depends upon the problem and computational resource availability.
Stemming identifies the common root form of a word by removing or replacing word suffixes (e.g. “flooding” is stemmed as “flood”), while lemmatization identifies the inflected forms of a word and returns its base form (e.g. “better” is lemmatized as “good”).
Huang, X., Li, Z., Wang, C., & Ning, H. (2020). Identifying disaster related social media for rapid response: a visual-textual fused CNN architecture. International Journal of Digital Earth, 13(9), 1017–1039. https://doi.org/10.1080/17538947.2019.1633425

Stemming and Lemmatization both generate the foundation sort of the inflected words and therefore the only difference is that stem may not be an actual word whereas, lemma is an actual language word.
Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used a corpus also to supply lemma which makes it slower than stemming. you furthermore might had to define a parts-of-speech to get the proper lemma.
The above points show that if speed is concentrated then stemming should be used since lemmatizers scan a corpus which consumes time and processing. It depends on the problem you’re working on that decides if stemmers should be used or lemmatizers.
for more info visit the link:
https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221

Stemming
is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers.
Often when searching text for a certain keyword, it helps if the search returns variations of the word.
For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats].
Lemmatization
looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’.
I did refer this link,
https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221

In short, the difference between these algorithms is that only lemmatization includes the meaning of the word in the evaluation. In stemming, only a certain number of letters are cut off from the end of the word to obtain a word stem. The meaning of the word does not play a role in it.

In short:
Lemmatization: uses context to transform words to their
dictionary(base) form also known as Lemma
Stemming: uses the stem of the word, most of the time removing derivational affixes.
source

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string