Natural Language Generation - how to test if it sounds natural - text

I just have a set of sentences, which I have generated based on painting analysis. However I need to test how natural they sound. Is there any api or application which does this?
I am using the Standford Parser to give me a breakdown, but this doesn't exactly do the job I want!
Also can one test how similar sentences are? As I randomly generating parts of sentences and want to check the variety of the sentences produced.

A lot of NLP stuff works using things called 'Language Models'.
A language model is something that can take in some text and return a probability. This probability should typically be indicative of how "likely" the given text is.
You typically build a language model by taking a large chunk of text (which we call the "training corpus") and computing some statistics out of it (which represent your "model"), and then using those statistics to take in new, previously unseen sentences and returning probabilities for them.
You should probably google for "language models", "unigram models", "n-gram models" and click on some of the results to find some article or presentation which helps you understand the previous sentence. (Its hard for me to recommend an appropriate tutorial for you because I don't know what your existing background is)
Anyway, one way to think about language models is that they are systems that take in new text and tell you how similar the new text is to the training corpus the language model was made out of. So if you build 2 language models, one out of all the plays written by Shakespeare and another out of a large number of legal documents, then the second one should be giving you a much higher probability to sentences for some new legal document that just got released (as compared to the first model) while the first model should give you a much higher probability for some other old english play (written by some other author) because that play is probably more similar to Shakespeare (in terms of the kind of words used, sentence lengths, grammar, etc) than it is to modern legal language.
All the things you see the stanford parser give you back for a sentence you give it are generated using language models. One way to think about how those features are built is to pretend that the computer tried every possible combination of tags and every possible parse tree for the sentence you gave it, and used some clever language model to identify which is most probable sequence of tags and most probable parse tree out there, and returned those back to you.
Getting back to your problem, you need to build a language model out of what you consider natural sounding text and then use that language model to evaluate the sentences you want to measure the naturalness of. To do this, you will have to identify a good training corpus and decide on what type of language model you want to build.
If you can't think of anything better, a collection of wikipedia articles might serve to be a good training corpus representing what natural sounding english looks like.
As for model type, an "n-gram model" would probably be good enough for your task. More complicated models like "Hidden Markov Models" and "PCFG's" (the stuff that is powering the stanford page you linked to) would definitely make things even better, but n-grams are definitely the most simple thing you could start with.

Related

Determining Grammatical Validity of Text Input

I am looking for some way to determine if textual input takes the form of a valid sentence; I would like to provide a warning to the user if not. Examples of input I would like to warn the user about:
"dog hat can ah!"
"slkj ds dsak"
It seems like this is a difficult problem, since grammars are usually derived from textbanks, and the words in the provided sentence input might not appear in the grammar. It also seems like parsers maybe make assumptions that the textual input is comprised of valid English words to begin with. (just my brief takeaway from playing around with Stanford NLP's GUI tool). My questions are as follows:
Is there some tool available to scan through text input and determine if it is made up of valid English words, or at least offer a probability on that? If not, I can write this, just wondering if it already exists. I figure this would be step 1 before determining grammatical correctness.
My understanding is that determining whether a sentence is grammatically correct is done simply by attempting to parse the sentence and see if it is possible. Is that accurate? Are there probabilistic parsers that offer a degree of confidence when ambiguity is encountered? (e.g., a proper noun not recognized)
I hesitate to ask this last question, since I saw it was asked on SO over a decade ago, but any updates as to whether there is a basic, readily available grammar for NLTK? I know English isn't simple, but I am truly just looking to parse relatively simple, single sentence input.
Thanks!
A starting point are classification models trained on the Corpus of Linguistic Acceptability (CoLA) task. There are several recent blog articles on how to fine tune the BERT models from HuggingFace (python) for this task. Here is one such blog article. You can also find already fine-tuned models for CoLA for various BERT flavors in the HuggingFace model zoo.

NLP - linguistic consistency analysis

I hope you can help me :).
I am working for a translation company.
As you know, every translation consists in splitting the original text into small segments and then re-joining them into the final product.
In other words, the segments are considered as "translation units".
Often, especially for large documents, the translators make some linguistic consistency errors, I try to explain it with an example.
In Spanish, you can use "tu" or "usted", depending on the context, and this determines the formality-informality tone of the sentence.
So, if you consider these two sentences of a document:
Lara, te has lavado las manos? (TU)
Lara usted se lavò las manos? (USTED)
They are BOTH correct, but if you consider the whole document, there is a linguistic inconsistency.
I am studying NLP basic in my spare time, and I am figuring out how to create a tool to perform a linguistic consistency analysis on a set of sentences.
I am looking in particular at Standford CoreNLP (I prefer Java to Python).
I guess that I need some linguistic tools to perform verb analysis first of all. And naturally, the tool would be able to work with different languages (EN, IT, ES, FR, PT).
Anyone can help me to figure out how to start this?
Any help would be appreciated,
thanks in advance!
Im not sure about Stanford CoreNLP, but if you're considering this an option, you could make your own tagger and use modifiers at pos tagging. Then, use this as a translation feature.
In other words, instead of just tagging a word to be a verb, you could tag it "a verb in the infinitive second person".
There are already good pre-tagged corpora out there for spanish that can help you do exactly that. For example, if you look at Universal Dependencies Ankora Corpus, you can find that there are annotations referring to the Person of a verb.
With a little tweaking, you could make a compose PoS that takes in "Verb-1st-Person" or something like that and train a Tagger.
I've made an article about how to do it in Python, but I bet that you can do it in Java using Weka. You can read the article here.
After this, I guess that the next step is that you ensure to match the person of one "translation unit" to the other, or make something in a pipeline fashion.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

How to compare complexities of corpora?

I would like to compare how complex (varied or predictable) are my three corpora. They are from different topics, so some vocabulary is different, some is the same. Looking at one of the data sets it's clear that the syntax is more difficult than in the other two, sentences are longer, etc. I built word N-Gram language models using the SRILM toolkit (I'm new to language modelling) with the idea that I can then compare these models. One measure mentioned in relation to language models is perplexity. I'm confused about the following question: Can I just use perplexities of the three LMs directly as a measure of how varied are the corpora? The vocabulary and the sizes of the corpora are different, so now I think that this won't be a good comparison. I also built LMs from POS-Tags but the quality of the POS-Tagging result is not good because the language is from fora, has spelling mistakes, ungrammatical sentences and so on. What measures could be used to compare complexity of corpora from different domains? I'd appreciate your advise.
[I'm new to Stackexchange. I posted this on Crossvalidated, but I think maybe here is more appropriate forum.]
"I also built LMs from POS-Tags but the quality of the POS-Tagging result is not good because the language is from fora, has spelling mistakes, ungrammatical sentences and so on."
Aside from it being noisy, like you pointed out, you should think carefully about whether particular linguistic features are useful in your analysis. Does one corpus having proportionally more nouns move you toward what you want to learn about the corpora? Maybe in something like authorship attribution, but I can't really think of anywhere else that's effective.
If data sparsity is an issue, LSI can help by collapsing related terms together. This could also help with the spelling issues, collapsing poorly spelt words with their correct counterparts if they appear in similar contexts.
"The vocabulary and the sizes of the corpora are different, so now I think that this won't be a good comparison."
It's not the end of the world. Having more data is always better, but you can work with what you have.
If you haven't chosen a language model yet, there's a few decisions you have to make:
Are you going to smooth the data? How?
Are you going to use an advanced technique to better exploit the data, such as Latent Semantic Indexing (LSI)?
You mention that you have a language model; I'm assuming your language model is a probability distribution such that P(N-gram|topic). If this is correct, you've already normalized the data, so the two probability distributions should be readily comparable. Having more data would get you a more reliable result, but if your corpora are "big enough" to sample each topic reliably, you can move right into comparison.
As for comparison, try the KL-Divergence. KL-Divergence is "a measure of the information lost when Q is used to approximate P." Less loss means that the corpora are more similar. If you want a symmetric comparison, a "cheap" way to do it is to add D(P||Q) + D(Q||P). Note, though:
The KL divergence is only defined if Q(i)=0 ⇒ P(i)=0, for all i (absolute continuity).
So you'll have to smooth, in some manner.

Document Analysis and Tagging

Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf

Resources