Name of task - splitting up complex sentences? - nlp

I've go a question about the name of an NLP task - Splitting up a complex sentence into simple ones.
For example, if I have this sentence:
"Input t on the username input box and password input box."
I'd like to split this sentence into simpler sentences:
"Input t on the username input box"
"Input t on the password input box"
What would this problem be called? I've tried clause extraction here but I don't want clauses, but rather, fully formed sentences. I've also tried 'sentence simplification' but it exceeds what I'm trying to do, with its lexical simplification and all.
Thanks

I don't think there is the name used by everyone but, for example, in this paper https://arxiv.org/pdf/1805.01035 they call it split-and-rephrase (in several other papers this term is used too).

Related

How to identify the similar words using the word2vec

input: I have a set of words(N) & input sentence
problem statement:
the sentence is dynamic, the user can give any sentence related to one business domain. we have to map the input sentence tokens to the set of words based on the closeness.
for example, we can use different words to ask the same meaning questions, and hard to maintain all the synonyms hence we have a mechanism to find similar words, we can map easily.
1) A meeting scheduled by john
2) A meeting organized by john
user can frame a sentence in different ways, like the above example.
scheduled & organized are very close.
N set has the word, scheduled. if a user gives a sentence like (2), I have to map the organized with scheduled.
Take a look at "Word Mover's Distance", a way to calculate differences between texts that's essentially based on "bags of word-vectors". It can be expensive to calculate, especially on longer texts, but generally identifies "similar" ranges-of-text better than a simple baseline like "average of all word-vectors".
Beyond that, some of the deeper-neural-network methods of vectorizing text – BERT, ELMo, etc – may do an even-more effective job of placing such "similar intent by different words" texts into close positions in a shared coordinate space.

Correct or complete a word based on context

I am working on text normalization. I have descriptions of variables/attributes, which I need to convert to correct english.
A an example is shown below:
"This is the sta of the customer's order"
The word 'sta' above needs to be converted to 'status' based on the error and the context.
I tried out a character level encoder decoder architecture, but did not get good results.I need some direction on how to approach this problem.
input :"This is the sta of the customer's order"
output: "This is the status of the customer's order"
This is called spell checking. There are ways to do so, one common way is to use minimum edit distance. An edit is one of these actions : adding a char, removing a char, replacing a char with another, transposing two adjacent chars. You can use edits to make new words out of mistaken words, and use a dictionary to see if the word really exists in English language. There may be more than 1 candidate for each incorrect word to choose from. There are also ways for candidate ranking.
Reading this paper may be a good start :
A Survey of Spelling Error Detection and Correction Techniques

What does generate() do when using NLTK in Python?

I've been working with NLTK for the past three days to get familiar and reading the "Natural Language processing" book to understand what's going on. I'm curious if someone could clarify for me the following:
Note that the first time you run this command, it is slow because it
gathers statistics about word sequences. Each time you run it, you
will get different output text. Now try generating random text in the
style of an inaugural address or an Internet chat room. Although the
text is random, it re-uses common words and phrases from the source
text and gives us a sense of its style and content. (What is lacking
in this randomly generated text?)
This part of the text, chapter 1, simply says that it "gathers statistics" and it will get "different output text"
What specifically does generate do and how does it work?
This example of generate() uses text3, which is the Bible's Genesis:
In the beginning , between me and thee and in the garden thou mayest
come in unto Noah into the ark , and Mibsam , And said , Is there yet
any portion or inheritance for us , and make thee as Ephraim and as
the sand of the dukes that came with her ; and they were come . Also
he sent forth the dove out of thee , with tabret , and wept upon them
greatly ; and she conceived , and called their names , by their names
after the end of the womb ? And he
Here, the generate() function seems to simply output phrases created by cutting off text at punctuation and randomly reassembling it but it has a bit of readability to it.
type(text3) will tell you that text3 is of type nltk.text.Text.
To cite the documentation of Text.generate():
Print random text, generated using a trigram language model.
That means that NLTK has created an N-Gram model for the Genesis text, counting each occurence of sequences of three words so that it can predict the most likely successor of any given two words in this text. N-Gram models will be explained in more detail in chapter 5 of the NLTK book.
See also the answers to this question.

Finding how relevant a text is, given a whitelist and blacklist of words/phrases

This is a case of me wanting to search for something online but not knowing what it's called.
I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.
For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:
want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position
What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:
Count the occurrences of the phrases in the text file that are also in my "rule book", and reject those that contain words that I do not want. This doesn't always work, though, because what if a description says "web designing not required"? Then my algorithm would say "That contains the word designing so it is not relevant" when it really was!
When searching the text for phrases that I do and do not want, count phrases within a certain Levenshtein distance as the same phrase. For example, designing and design should be treated the same way, as well as misspellings of words, such as programing.
I have a large collection of descriptions that I have looked through manually. Is there a way I could "teach" the program "these are examples of good descriptions, these are examples of bad ones"?
Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?
You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.
To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.

Dividing string of characters to words and sentences (English only)

I'm looking for a solution to following task. I take few random pages from random book in English and remove all non letter characters and convert all chars to lower case. As a result I have something like:
wheniwasakidiwantedtobeapilot...
Now what I'm looking for is something that could reverse that process with quite a good accuracy. I need to find words and sentence separators. Any ideas how to approach this problem? Are there existing solutions I can base on without reinventing the wheel?
This is harder than normal tokenization since the basic tokenization task assumes spaces. Basically all that normal tokenization has to figure out is, for example, whether punctuation should be part of a word (like in "Mr.") or separate (like at the end of a sentence). If this is what you want, you can just download the Stanford CoreNLP package which performs this task very well with a rule-based system.
For your task, you need to figure out where to put in the spaces. This tutorial on Bayesian inference has a chapter on word segmentation in Chinese (Chinese writing doesn't use spaces). The same techniques could be applied to space-free English.
The basic idea is that you have a language model (an N-Gram would be fine) and you want to choose a splitting that maximizes the probability the data according to the language model. So, for example, placing a space between "when" and "iwasakidiwantedtobeapilot" would give you a higher probability according to the language model than placing a split between "whe" and "niwasakidiwantedtobeapilot" because "when" is a better word than "whe". You could do this many times, adding and removing spaces, until you figured out what gave you the most English-looking sentence.
Doing this will give you a long list of tokens. Then when you want to split those tokens into sentences you can actually use the same technique except instead of using a word-based language model to help you add spaces between words, you'll use a sentence-based language model to split that list of tokens into separate sentences. Same idea, just on a different level.
The tasks you describe are called "words tokenization" and "sentence segmentation". There are a lot of literature about them in NLP. They have very simple straightforward solutions, as well as advanced probabilistic approaches based on language model. Choosing one depends on your exact goal.

Resources