How would you use Markov chains to find characteristics of English text? - statistics

I am working on a project that involves identifying characteristics of text in specific groups. For example, let's say there are two groups of texts: one containing emails sent to employees and one containing emails sent to bosses. The idea is to explore there are syntactic or word choice differences when writing an email to someone below and above you. My initial method was to use k-means clustering to identify syntax sequences and sequences of words that were (1) unique to each group and (2) identified further sub-groups (which was successful).
However, someone has suggested that I use Markov chains to analyze individuals' email-writing habits and identify individuals' email characteristics of each group (and they say I should arrive at the same sequences). I have never used Markov chains before, and I'm quite lost as I'm trying to figure this out. How would Markov chains lead me to syntactic or word sequences? Is there a tutorial out there for this kind of thing? Any help and guidance in getting me started here is greatly appreciated.

I think you should definitely have a look at Latent Dirichlet Allocation.
This method allows capturing distributions of topics in a corpus. The main reference for this paper is Thomas L. Griffiths and Mark Steyvers (2004). If you want more hands-on materials, I'd suggest you look at this website (https://github.com/sekhansen) which provides nice tutorials on this subject.

Related

How to extract "buyer" and "seller" from a contract

I'm learning about natural language processing and I'm wondering if someone could point me in the right direction.
Say I have a bunch of contracts, and they have something like:
Joe's Farm, hereafter known as the seller,
and Bob's supermarket, hereafter known as the buyer, blah blah..
I'd like to be able to identify which party is the buyer and seller in this sentence. From what I have read, it should be theoretically possible to:
1. Give the AI a lot of sample sentences and tell it "this is the buyer/seller".
2. After training, it should be able to analyze a new sentence.
I have tried some entity extraction (tokenizing the sentence and identifying the party names) but I don't know how to tell it "this party is the buyer".
One workaround is to identify segments of the sentence and search if that has the word "buyer" in it... which probably works in most cases, but I want to try to do this in an "AI" way.
Can anyone point me to the right direction on what to research?
Looks like a coreference resolution problem to me.
Stanford CoreNLP may be a good starting point. It comes with a deterministic, a statistical, and a neural system, as well as pre-trained models.
How you solve that problem will depend on several factors, most importantly:
Do all contracts have the same format?
After specifying the seller and the buyer as you showed in your examples, do their real names appear again the text? Or just referred to as seller and buyer?
With that in mind, let's assume that the contracts have various formats and that the proper/real names of the seller and the buyer appear only once in the text somewhere in the introduction of the contract. This second assumption simplifies the problem (and is more likely to be the case in real-world contracts).
I would tackle the problem in three steps:
Teach the program how to identify the introduction of the contract (i.e. the paragraph in which the contract says who is who; kinda like your example sentences.)
Split the introduction into two parts: the part/sentence(s) where the seller is defined and the part/sentence(s) where the buyer is defined.
Finally, look into the seller part to find who the seller is and the buyer part to find who the buyer is.
To solve the 1st step, a small training dataset would be necessary. If not available, you could manually identify introductions of several contracts and use them as your training dataset. From here, Naive Bayes would probably be the simplest way to identify if a section of the contract is an introduction or not (you can randomly divide the contract into multiple chunks). Naive Bayes relies only on the frequency of tokens (not the ordering). You can read more here.
To solve the 2nd step, I would pretty much repeat what's done in the first step: use a dataset to "classify" sections of the introduction as the seller part and the buyer part. Though, this step will probably require more accuracy than the first one. So, I suggest doing a n-gram language model. That looks at the frequency of tokens, but also the ordering and succession. You can read more here. For a n-gram, you want something in between: not too short (1-gram = Naive Bayes) and not too long (> ~ 6-gram) to avoid much overlaps between the seller and buyer sentences.
For the 3rd and last step, I cannot think of a straightforward way, but I would remove the stop words (i.e. frequent English words) first. Then, I would try to find rare words in close proximity to the target terms (buyer and seller). Since we are assuming that the real names only appear once in the contract, that could be a rule that helps you identify them.
There are probably many other things you can try depending on the size/availability of training datasets, but this should give you a start.

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
+
tokenization
+
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
tokenization
+
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

NLP: retrieve vocabulary from text

I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.

Associating free text statements with pre-defined attributes

I have a list of several dozen product attributes that people are concerned with, like
Financing
Manufacturing quality
Durability
Sales experience
and several million free-text statements from customers about the product, e.g.
"The financing was easy but the housing is flimsy."
I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association.
In the given example, there would be a strong positive association to Financing and a strong negative association to Manufacturing quality.
It feels like this type of problem is probably the realm of Natural Language Programming (NLP). However, I spent several hours reading up on things like OpenNLP and NLTK and find there's so much domain specific terminology that I cannot figure out where to focus to solve this specific problem.
So my three-part question:
Is NLP the correct route to solve this class of problem?
What aspect of NLP should I focus on learning for this specific problem?
Are there alternatives I have not considered?
A resource you might find handy is SentiWordNet. (http://sentiwordnet.isti.cnr.it/) Which is like a dictionary that has a sentiment grade for words. It will tell you to what degree it thinks a word is positive, negative, or objective.
You can then combine that with some nltk code that looks through your sentences for the words you want to associate the sentiment with. So you would write a script to get some level of meaningful chunks of text that surround the words you were looking at, maybe sentence or clause level. Then you can have another thing that runs through the surrounding words and grab all the sentiment scores from the SentiWordNet.
I have some old code that did this and can place on github if you'd like, but you'd still need to make your own request for SentiWordNet.
I guess your problem is more on association rather than just classification. Now moving forward with this assumption:
Is NLP the correct route to solve this class of problem?
Yes.
What aspect of NLP should I focus on learning for this specific problem?
Part of speech tagging
Sentiment analysis
Maximum entrophy
Are there alternatives I have not considered?
In depth study of automata theory with respect to NLP will help you a lot, it helped me a lot in grasping the implementations like OpenNLP.
Yes, this is a NLP problem by the name of Sentiment analysis. Sentiment analysis is an active research area with different approaches and a task where a lot of other NLP-methods have to work together, so it is certainly not the easiest field to get started with in NLP.
A more or less recent survey of the academic research in the field can be found in Pang & Lee (2008).

Resources