How to use NLP to detect sentences in a long text? - nlp

I am using automatic speech recognition to extract text from an audio file. However, the output is just a long sequence of words with no punctuation whatsoever. What I'd like to do is use some NLP technique to estimate beginnings and endings of sentences, or, in other words, predict positions of punctuation markers. I found that CoreNLP can do sentence splitting, but apparently only if punctuation is already present.

You may find relevant info in the answers to this other question: Sentence annotation in text without punctuation.
In particular, one of the answers claims the deepsegment package works well on unpunctuated text.

In spoken language you often find that people don't use sentences, but that the clauses simply run into each other. The degree to which this happens depends on the formality and setting -- a speech will conform more to written sentence structures than a conversation in a pub among friends.
One approach you could try is to identify words that typically begin/end sentences in written text, and see if that can help you segmenting your data. Or look for verbs, and then try to find boundaries between them; this might be clause boundaries rather than sentence boundaries, but as I said, in spoken language there often are no sentences.

Related

How to extract meaning of colloquial phrases and expressions in English

I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).

Independent clause boundary disambiguation, and independent clause segmentation – any tools to do this?

I remember skimming the sentence segmentation section from the NLTK site a long time ago.
I use a crude text replacement of “period” “space” with “period” “manual line break” to achieve sentence segmentation, such as with a Microsoft Word replacement (. -> .^p) or a Chrome extension:
https://github.com/AhmadHassanAwan/Sentence-Segmentation
https://chrome.google.com/webstore/detail/sentence-segmenter/jfbhkblbhhigbgdnijncccdndhbflcha
This is instead of an NLP method like the Punkt tokenizer of NLTK.
I segment to help me more easily locate and reread sentences, which can sometimes help with reading comprehension.
What about independent clause boundary disambiguation, and independent clause segmentation? Are there any tools that attempt to do this?
Below is some example text. If an independent clause can be identified within a sentence, there’s a split. Starting from the end of a sentence, it moves left, and greedily splits:
E.g.
Sentence boundary disambiguation
(SBD), also known as sentence
breaking, is the problem in natural
language processing of deciding where
sentences begin and end.
Often, natural language processing
tools
require their input to be divided into
sentences for a number of reasons.
However, sentence boundary
identification is challenging because punctuation
marks are often ambiguous.
For example, a period may
denote an abbreviation, decimal point,
an ellipsis, or an email address - not
the end of a sentence.
About 47% of the periods in the Wall
Street Journal corpus
denote abbreviations.[1]
As well, question marks and
exclamation marks may
appear in embedded quotations,
emoticons, computer code, and slang.
Another approach is to automatically
learn a set of rules from a set of
documents where the sentence
breaks are pre-marked.
Languages like Japanese and Chinese
have unambiguous sentence-ending
markers.
The standard 'vanilla' approach to
locate the end of a sentence:
(a) If
it's a period,
it ends a sentence.
(b) If the preceding
token is on my hand-compiled list of
abbreviations, then
it doesn't end a sentence.
(c) If the next
token is capitalized, then
it ends a sentence.
This
strategy gets about 95% of sentences
correct.[2]
Solutions have been based on a maximum
entropy model.[3]
The SATZ architecture uses a neural
network to
disambiguate sentence boundaries and
achieves 98.5% accuracy.
(I’m not sure if I split it properly.)
If there are no means to segment independent clauses, are there any search terms that I can use to further explore this topic?
Thanks.
To the best of my knowledge, there is no readily available tool to solve this exact problem. Usually, NLP systems do not get into the problem of identifying different types of sentences and clauses as defined by English grammar. There is one paper published in EMNLP which provides an algorithm which uses the SBAR tag in parse trees to identify independent and dependent clauses in a sentence.
You should find section 3 of this paper useful. It talks about English language syntax in some details, but I don't think the entire paper is relevant to your question.
Note that they have used the Berkeley parser (demo available here), but you can obviously any other constituency parsing tool (e.g. the Stanford parser demo available here).
Chthonic Project gives some good information here:
Clause Extraction using Stanford parser
Part of the answer:
It is probably better if you primarily use the constituenty-based
parse tree, and not the dependencies.
The clauses are indicated by the SBAR tag, which is a clause
introduced by a (possibly empty) subordinating conjunction.
All you need to do is the following:
Identify the non-root clausal nodes in the parse tree
Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.
For a list of all clausal tags (and, in fact, all Penn Treebank tags),
see this list:
http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
For an online parse-tree visualization, you may want to use the
online Berkeley parser demo.
It helps a lot in forming a better intuition.
Here's the image generated for your example sentence:
I don't know any tools that do clause segmentation, but in rhetorical structure theory, there is a concept called "elementary discourse unit" which work in a similar way as a clause. They are sometimes, however, slightly smaller than clauses.
You may see the section 2.0 of this manual for more information about this concept:
https://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf
There are some software available online that can segment sentence into their elementary discourse unit , for instance:
http://alt.qcri.org/tools/discourse-parser/
and
https://github.com/jiyfeng/DPLP
Via user YourWelcomeOrMine from the subreddit /r/LanguageTechnology/:
“I would check out Stanford's CoreNLP. I believe you can customize how
a sentence is broken up.”
Via user Breakthrough from Superuser:
I've found different classifiers using
the NPS Chat Corpus training set to be
very effective for a similar
application.

How to automatically detect sentence fragments in a text file

I am working on a project and need a tool or an API in order to detect sentence fragments in large text. There are many solutions such as OpenNLP for detecting sentences in given file. However, I wasn't able to find any explicit solution to the problem of finding words, phrases or event character combinations which are not belong to any grammatically correct sentences.
Any help will be greatly appreciated.
Thanks,
Lorderon
you could use n-grams as a work around:
Suppose you have a large collection of text with real sentences for reference. You could extract all sequences of 1,2,3,4,5, or more words and then in your text double check if the fragments from your text exist as n-grams.
you can download n-grams directly from google: http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html but you might need a lot of traffic.
You could also count the n-grams yourself in this case you can take the parsed data sets of the wikipedia from my website:
http://glm.rene-pickhardt.de/data/ and the source code from https://github.com/renepickhardt/generalized-language-modeling-toolkit in order to create the ngrams yourself (or any other ngram toolkit like srilm, kylm, opengrm,...)

Dividing string of characters to words and sentences (English only)

I'm looking for a solution to following task. I take few random pages from random book in English and remove all non letter characters and convert all chars to lower case. As a result I have something like:
wheniwasakidiwantedtobeapilot...
Now what I'm looking for is something that could reverse that process with quite a good accuracy. I need to find words and sentence separators. Any ideas how to approach this problem? Are there existing solutions I can base on without reinventing the wheel?
This is harder than normal tokenization since the basic tokenization task assumes spaces. Basically all that normal tokenization has to figure out is, for example, whether punctuation should be part of a word (like in "Mr.") or separate (like at the end of a sentence). If this is what you want, you can just download the Stanford CoreNLP package which performs this task very well with a rule-based system.
For your task, you need to figure out where to put in the spaces. This tutorial on Bayesian inference has a chapter on word segmentation in Chinese (Chinese writing doesn't use spaces). The same techniques could be applied to space-free English.
The basic idea is that you have a language model (an N-Gram would be fine) and you want to choose a splitting that maximizes the probability the data according to the language model. So, for example, placing a space between "when" and "iwasakidiwantedtobeapilot" would give you a higher probability according to the language model than placing a split between "whe" and "niwasakidiwantedtobeapilot" because "when" is a better word than "whe". You could do this many times, adding and removing spaces, until you figured out what gave you the most English-looking sentence.
Doing this will give you a long list of tokens. Then when you want to split those tokens into sentences you can actually use the same technique except instead of using a word-based language model to help you add spaces between words, you'll use a sentence-based language model to split that list of tokens into separate sentences. Same idea, just on a different level.
The tasks you describe are called "words tokenization" and "sentence segmentation". There are a lot of literature about them in NLP. They have very simple straightforward solutions, as well as advanced probabilistic approaches based on language model. Choosing one depends on your exact goal.

What methods are used for recognizing language a text is written in?

If I have a given text (both long or short), with which methods do you usually detect which language it is written in?
It is clear that:
You need a training corpus to train the models you use (e.g. neural networks, if used)
Easiest thing coming to my mind is:
Check characters used in the text (e.g. hiragana are only used in Japanese, Umlauts probably only in European languages, ç in French, Turkish, …)
Increase the check to two or three letter pairs to find specific combinations of a language
Lookup a dictionary to check which words occur in which language (probably only without stemming, as stemming depends on the language)
But I guess there are better ways to go. I am not searching for existing projects (those questions have already been answered), but for methods like Hidden-Markov-Models, Neural Networks, … whatever may be used for this task.
In product I'm working on we use dictionary-based approach.
First relative probabilities for all words in training corpus are calculated and this is stored as a model.
Then input text is processed word by word to see if particular model gives best match (much better then the other models).
In some cases all models provide quite bad match.
Few interesting points:
As we are working with social media both normalized and non-normalized matches are attempted (in this context normalization is removal of diacritics from symbols). Non-normalized matches have a higher weight
This method works rather bad on very short phrases (1-2 words) in particular when these words are there in few languages, which is the case of few European languages
Also for a better detection we are considering added per-character model as you have described (certain languages have certain unique characters)
Btw, we use ICU library to split words. Works rather good for European and Eastern languages (currently we support Chinese)
Check the Cavnar and Trenkle algorithm.

Resources