NLP: retrieve vocabulary from text

NLP: retrieve vocabulary from text - nlp

I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.

You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.

Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.

Related

What criterion was used to build the list of english stop words in nltk (python)?

I wonder why words like "therefore" or "however" or "etc" are not included for instance.
Can you suggest a strategy to make this list automatically more general?
One obvious solution is to include every word that arises in all documents. However, maybe in some documents "therefore" cannot arise.
Just to be clear I am not talking about augment the list by including words of specific data sets. For instance, in some data sets, it may be interested to filter some proper names. I am not talking about this. I am talking about the inclusion of general words that can appear in any english text.

The problem with tinkering with a stop word list is that there is no good way to gather all texts about a certain topic and then automatically discard everything that occurs too frequent. It may lead to inadvertently removing just the topic that you were looking for – because in a limited corpus it occurs relatively frequent. Also, any list of stop words may already contain just the phrase you are looking for. As an example, automatically creating a list of 1980s music groups would almost certainly discard the group The The.
The NLTK documentation refers to where their stopword list came from as:
Stopwords Corpus, Porter et al.
However, that reference is not very well written. It seems to state this was part of the 1980's Porter Stemmer (PDF: http://stp.lingfil.uu.se/~marie/undervisning/textanalys16/porter.pdf; thanks go to alexis for the link), but this actually does not mention stop words. Another source states that:
The Porter et al refers to the original Porter stemmer paper I believe - Porter, M.F. (1980): An algorithm for suffix stripping. Program 14 (3): 130—37. - although the et al is confusing to me. I remember being told the stopwords for English that the stemmer used came from a different source, likely this one - "Information retrieval" by C. J. Van Rijsbergen (Butterworths, London, 1979).
https://groups.google.com/forum/m/#!topic/nltk-users/c8GHEA8mq8A
The full text of Van Rijsbergen can be found online (PDF: http://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf); it mentions several approaches to preprocessing text and so may well be worth a full read. From a quick glance-through it seems the preferred algorithm to generate a stop word list goes all the way back to research such as
LUHN, H.P., 'A statistical approach to mechanised encoding and searching of library information', IBM Journal of Research and Development, 1, 309-317 (1957).
dating back to the very early stages of automated text processing.

The title of your question asks about the criteria that were used to compile the stopwords list. A look at stopwords.readme() will point you to the Snowball source code, and based on what I read there I believe the list was basically hand-compiled, and its primary goal was the exclusion of irregular word forms in order to provide better input to the stemmer. So if some uninteresting words were excluded, it was not a big problem for the system.
As for how you could build a better list, that's a pretty big question. You could try computing a TF-IDF score for each word in your corpus. Words that never get a high tf-idf score (for any document) are uninteresting, and can go in the stopword list.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.

Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.

This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.

If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.

I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Extracting related text given a sentence, keywords or topic

Are there any known ways (above and beyond statistical analysis, but not necessarily excluding it as being part of the solution) to relate sentences or concepts to one another using Natural Language Processing. Thus far I've only worked with NLTK and Stanford-NLP to aid in my project, but I am open to alternative open source solutions.
As an example take the following George Orwell essay (http://orwell.ru/library/essays/wiw/english/e_wiw). Suppose I gave the application the sentence
"What are George Orwell's opinions on writers."
or perhaps
"George Orwell believes writers enjoy writing to express their creativity, to make a point and for their egos."
Might yield lines from the essay like
"The aesthetic motive is very feeble in a lot of writers, but even a pamphleteer or writer of textbooks will have pet words and phrases which appeal to him for non-utilitarian reasons; or he may feel strongly about typography, width of margins, etc."
or
"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."
I understand that this is not easy and I may not achieve much accuracy, but I was hoping for ideas on what already exists and what I could try to start off, or at least get the best results possible based on what is already known and out there.

The simplest way of doing this might be using some distance functions (such as Cosine similarity) between your query sentence and the sentence pool. It's easy to implement. Create a vocabulary from the text collection and each sentence is represented as a vector. You can use TF-IDF to represent values in the vector, and calculate the cosine similarity between sentences, and get the highest scored sentence with respect to your query sentence.
Or you can build index from your corpus and use for example Lucene and let it do the work for you.
You may also consider using LSA (Latent Semantic Analysis) where you can get the similarity between sentences.

From what I understand from your question (and also your comment) is you are more interested in understanding the meaning of individual sentence and then equate with each other in proximity. Statistical approach, in my opinion, is more for "getting a feel" of the sentence than understanding it. In my opinion I would suggest deep parsing approach.
Deep parse the sentence, understand what roles the words play in the sentence, understand what the subject-verb-object model (left to right parsing and such techniques) and then have a vocabulary that helps you categorise the nouns and verbs.
e.g.
"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."
Parsing this sentence, lets you understand the subject of the sentence is "serious writers" (serious being an adjective, writers basically). In the verb form it states "are" (current state) and "interested". Each verb then points to some more vocabulary including adjectives. If you arrange this vocabulary in correct manner (and keep building it) I think you should get somewhere with your problem.

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.
Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.
Stemmers
[in]: having
[out]: hav
Lemmatizers
[in]: having
[out]: have
So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English
If not, then how should we move on to build robust lemmatizers that
can take on nounify, verbify, adjectify and adverbify
preprocesses?
How could the lemmatization task be easily scaled to other languages
that have similar morphological structures as English?

Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"
Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving with driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.
Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify, and adverbify preprocesses?
What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?
If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.
If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, and some might want it fined-grained.
Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"
What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).
With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although a combination of machine learning and manually created rules has been done too (see e.g. this).
Obviously, for most languages, you do not want to create the lookup table by
hand, but instead, generate it from a description of the morphology of
that language. For inflectional languages, you can go the engineering
way of Hajic for Czech or Mikheev for Russian, or, if you are daring,
you use two-level morphology. Or you can do something in between,
such as Hana (myself) (Note that these are all full
morphological analyzers that include lemmatization as one of their features). Or you can learn
the lemmatizer in an unsupervised manner a la Yarowsky and
Wicentowski, possibly with manual post-processing, correcting the
most frequent words.
There are way too many options and it really all depends on what you want to do with the results.

One classical application of either stemming or lemmatization is the improvement of search engine results: By applying stemming (or lemmatization) to the query as well as (prior to indexing) to all tokens indexed, users searching for, say, "having" are able to find results containing "has".
(Arguably, verbs are somewhat uncommon in most search queries, but the same principle applies to nouns, especially in languages with a rich noun morphology.)
For the purpose of search result improvement, it is not actually important whether the stem (or lemma) is meaningful ("have") or not ("hav"). It only needs to able to represent the word in question, and all its inflectional forms. In fact, some systems use numbers or other kinds of id-strings instead of either stem or lemma (or base form or whatever it may be called).
Hence, this is an example of an application where stemmers (by your definition) are as good as lemmatizers.
However, I am not quite convinced that your (implied) definition of "stemmer" and "lemmatizer" are generally accepted. I am not sure if there is any generally accepted definition of these terms, but the way I define them is as follows:
Stemmer: A function that reduces inflectional forms to stems or base forms, using rules and lists of known suffixes.
Lemmatizer: A function that performs the same reduction, but using a comprehensive full-form dictionary to be able to deal with irregular forms.
Based on these definitions, a lemmatizer is essentially a higher-quality (and more expensive) version of a stemmer.

The answer is highly dependent on the task or specific field of study within the Natural Language Processing (NLP) that we are talking about.
It is worth pointing out that it has been proved that in some specific tasks, like Sentiment Analysis (that is a favorite sub-field in NLP), using a Stemmer or Lemmatizer as a feature in the development of a system (training a machine learning model) does not have a noticeable effect on the accuracy of the model no matter how great the tool is. Even though it makes the performance a little bit better, there are more important features like Dependency parsing that have a considerable potential to be worked on in such systems.
It is important to mention that the characteristics of the language which we are working on should also be taken into the consideration.

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.

What are the most challenging issues in Sentiment Analysis(opinion mining)?

Opinion Mining/Sentiment Analysis is a somewhat recent subtask of Natural Language processing.Some compare it to text classification,some take a more deep stance towards it. What do you think about the most challenging issues in Sentiment Analysis(opinion mining)? Can you name a few?

The key challenges for sentiment analysis are:-
1) Named Entity Recognition - What is the person actually talking about, e.g. is 300 Spartans a group of Greeks or a movie?
2) Anaphora Resolution - the problem of resolving what a pronoun, or a noun phrase refers to. "We watched the movie and went to dinner; it was awful." What does "It" refer to?
3) Parsing - What is the subject and object of the sentence, which one does the verb and/or adjective actually refer to?
4) Sarcasm - If you don't know the author you have no idea whether 'bad' means bad or good.
5) Twitter - abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar, ...

I agree with Hightechrider that those are areas where Sentiment Analysis accuracy can see improvement. I would also add that sentiment analysis tends to be done on closed-domain text for the most part. Attempts to do it on open domain text usually winds up having very bad accuracy/F1 measure/what have you or else it is pseudo-open-domain because it only looks at certain grammatical constructions. So I would say topic-sensitive sentiment analysis that can identify context and make decisions based on that is an exciting area for research (and industry products).
I'd also expand his 5th point from Twitter to other social media sites (e.g. Facebook, Youtube), where short, ungrammatical utterances are commonplace.

I think the answer is the language complexity, mistakes in grammar, and spelling. There is vast of ways people expresses there opinions, e.g., sarcasms could be wrongly interpreted as extremely positive sentiment.

The question may be too generic, because there are several types of sentiment analysis (document level, sentence level, comparative sentiment analysis, etc.) and each type has some specific problems.
Generally speaking, I agree with the answer by #Ian Mercer, and I would add 3 other issues:
How to detect a more in depth sentiment/emotion. Positive and negative is a very simple analysis, one of the challenge is how to extract emotions like how much hate there is inside the opinion, how much happiness, how much sadness, etc.
How to detect the object that the opinion is positive for and the object that the opinion is negative for. For example, if you say "She won him!", this means a positive sentiment for her and a negative sentiment for him, at the same time.
How to analyze very subjective sentences or paragraphs. Sometimes even for humans it is very hard to agree on the sentiment of this high subjective texts. Imagine for a computer...

Although this is a little bit an old question, let me add some note related to Arabic sentiment anlsysis in specific. Arabic language has morphological complexities and dialectal varieties which require advanced preprocessing and lexical building processes that surpass what is needed for the English language.
Please, refer to
"https://www.researchgate.net/publication/280042139_Survey_on_Arabic_Sentiment_Analysis_in_Twitter"
"https://link.springer.com/chapter/10.1007/978-3-642-35326-0_14"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string