Dividing string of characters to words and sentences (English only) - text

I'm looking for a solution to following task. I take few random pages from random book in English and remove all non letter characters and convert all chars to lower case. As a result I have something like:
wheniwasakidiwantedtobeapilot...
Now what I'm looking for is something that could reverse that process with quite a good accuracy. I need to find words and sentence separators. Any ideas how to approach this problem? Are there existing solutions I can base on without reinventing the wheel?

This is harder than normal tokenization since the basic tokenization task assumes spaces. Basically all that normal tokenization has to figure out is, for example, whether punctuation should be part of a word (like in "Mr.") or separate (like at the end of a sentence). If this is what you want, you can just download the Stanford CoreNLP package which performs this task very well with a rule-based system.
For your task, you need to figure out where to put in the spaces. This tutorial on Bayesian inference has a chapter on word segmentation in Chinese (Chinese writing doesn't use spaces). The same techniques could be applied to space-free English.
The basic idea is that you have a language model (an N-Gram would be fine) and you want to choose a splitting that maximizes the probability the data according to the language model. So, for example, placing a space between "when" and "iwasakidiwantedtobeapilot" would give you a higher probability according to the language model than placing a split between "whe" and "niwasakidiwantedtobeapilot" because "when" is a better word than "whe". You could do this many times, adding and removing spaces, until you figured out what gave you the most English-looking sentence.
Doing this will give you a long list of tokens. Then when you want to split those tokens into sentences you can actually use the same technique except instead of using a word-based language model to help you add spaces between words, you'll use a sentence-based language model to split that list of tokens into separate sentences. Same idea, just on a different level.

The tasks you describe are called "words tokenization" and "sentence segmentation". There are a lot of literature about them in NLP. They have very simple straightforward solutions, as well as advanced probabilistic approaches based on language model. Choosing one depends on your exact goal.

Related

How to automate finding sentences similar to the ones from a given list?

I have a list of let's say "forbidden sentences" (1000 of them, each with around 40 words). I want to create a tool that will find and mark them in a given document.
The problem is that in such document this forbidden sentence can be expressed differently than it is on this list keeping the same meaning but changed by using synonyms, a few words more or less, different word order, punctuation, grammar etc. The fact that this is all in Polish is not making things easier with each noun, pronoun, and adjective having 14 cases in total plus modifiers and gender that changes the words further. I was also thinking about making it so that the found sentences are ranked by the probability of them being forbidden with some displaying less resemblance.
I studied IT for two years but I don't have much knowledge in NLP. Do you think this is possible to be done by an amateur? Could you give me some advice on where to start, what tools to use best to put it all together? No need to be fancy, just practical. I was hoping to find some ready to use code cause i imagine this is sth that was made before. Any ideas where to find such resources or what keywords to use while searching? I'd really appreciate some help cause I'm very new to this and need to start with the basics.
Thanks in advance,
Kamila
Probably the easiest first try will be to use polish SpaCy, which is an extension of popular production-ready NLP library to support polish language.
http://spacypl.sigmoidal.io/#home
You can try to do it like this:
Split document into sentences.
Clean these sentences with spacy (deleting stopwords, punctuation, doing lemmatization - it will help you with many differnet versions of the same word)
Clean "forbidden sentences" as well
Prepare vector representation of each sentence - you can use spaCy methods
Calculate similarity between sentences - cosine similarity
You can set threshold, from which if sentences of document is similar to any of "forbidden sentences" it will be treated as forbidden
If anything is not clear let me know.
Good luck!

How to extract meaning of colloquial phrases and expressions in English

I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
+
tokenization
+
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
tokenization
+
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

Independent clause boundary disambiguation, and independent clause segmentation – any tools to do this?

I remember skimming the sentence segmentation section from the NLTK site a long time ago.
I use a crude text replacement of “period” “space” with “period” “manual line break” to achieve sentence segmentation, such as with a Microsoft Word replacement (. -> .^p) or a Chrome extension:
https://github.com/AhmadHassanAwan/Sentence-Segmentation
https://chrome.google.com/webstore/detail/sentence-segmenter/jfbhkblbhhigbgdnijncccdndhbflcha
This is instead of an NLP method like the Punkt tokenizer of NLTK.
I segment to help me more easily locate and reread sentences, which can sometimes help with reading comprehension.
What about independent clause boundary disambiguation, and independent clause segmentation? Are there any tools that attempt to do this?
Below is some example text. If an independent clause can be identified within a sentence, there’s a split. Starting from the end of a sentence, it moves left, and greedily splits:
E.g.
Sentence boundary disambiguation
(SBD), also known as sentence
breaking, is the problem in natural
language processing of deciding where
sentences begin and end.
Often, natural language processing
tools
require their input to be divided into
sentences for a number of reasons.
However, sentence boundary
identification is challenging because punctuation
marks are often ambiguous.
For example, a period may
denote an abbreviation, decimal point,
an ellipsis, or an email address - not
the end of a sentence.
About 47% of the periods in the Wall
Street Journal corpus
denote abbreviations.[1]
As well, question marks and
exclamation marks may
appear in embedded quotations,
emoticons, computer code, and slang.
Another approach is to automatically
learn a set of rules from a set of
documents where the sentence
breaks are pre-marked.
Languages like Japanese and Chinese
have unambiguous sentence-ending
markers.
The standard 'vanilla' approach to
locate the end of a sentence:
(a) If
it's a period,
it ends a sentence.
(b) If the preceding
token is on my hand-compiled list of
abbreviations, then
it doesn't end a sentence.
(c) If the next
token is capitalized, then
it ends a sentence.
This
strategy gets about 95% of sentences
correct.[2]
Solutions have been based on a maximum
entropy model.[3]
The SATZ architecture uses a neural
network to
disambiguate sentence boundaries and
achieves 98.5% accuracy.
(I’m not sure if I split it properly.)
If there are no means to segment independent clauses, are there any search terms that I can use to further explore this topic?
Thanks.
To the best of my knowledge, there is no readily available tool to solve this exact problem. Usually, NLP systems do not get into the problem of identifying different types of sentences and clauses as defined by English grammar. There is one paper published in EMNLP which provides an algorithm which uses the SBAR tag in parse trees to identify independent and dependent clauses in a sentence.
You should find section 3 of this paper useful. It talks about English language syntax in some details, but I don't think the entire paper is relevant to your question.
Note that they have used the Berkeley parser (demo available here), but you can obviously any other constituency parsing tool (e.g. the Stanford parser demo available here).
Chthonic Project gives some good information here:
Clause Extraction using Stanford parser
Part of the answer:
It is probably better if you primarily use the constituenty-based
parse tree, and not the dependencies.
The clauses are indicated by the SBAR tag, which is a clause
introduced by a (possibly empty) subordinating conjunction.
All you need to do is the following:
Identify the non-root clausal nodes in the parse tree
Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.
For a list of all clausal tags (and, in fact, all Penn Treebank tags),
see this list:
http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
For an online parse-tree visualization, you may want to use the
online Berkeley parser demo.
It helps a lot in forming a better intuition.
Here's the image generated for your example sentence:
I don't know any tools that do clause segmentation, but in rhetorical structure theory, there is a concept called "elementary discourse unit" which work in a similar way as a clause. They are sometimes, however, slightly smaller than clauses.
You may see the section 2.0 of this manual for more information about this concept:
https://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf
There are some software available online that can segment sentence into their elementary discourse unit , for instance:
http://alt.qcri.org/tools/discourse-parser/
and
https://github.com/jiyfeng/DPLP
Via user YourWelcomeOrMine from the subreddit /r/LanguageTechnology/:
“I would check out Stanford's CoreNLP. I believe you can customize how
a sentence is broken up.”
Via user Breakthrough from Superuser:
I've found different classifiers using
the NPS Chat Corpus training set to be
very effective for a similar
application.

What methods are used for recognizing language a text is written in?

If I have a given text (both long or short), with which methods do you usually detect which language it is written in?
It is clear that:
You need a training corpus to train the models you use (e.g. neural networks, if used)
Easiest thing coming to my mind is:
Check characters used in the text (e.g. hiragana are only used in Japanese, Umlauts probably only in European languages, ç in French, Turkish, …)
Increase the check to two or three letter pairs to find specific combinations of a language
Lookup a dictionary to check which words occur in which language (probably only without stemming, as stemming depends on the language)
But I guess there are better ways to go. I am not searching for existing projects (those questions have already been answered), but for methods like Hidden-Markov-Models, Neural Networks, … whatever may be used for this task.
In product I'm working on we use dictionary-based approach.
First relative probabilities for all words in training corpus are calculated and this is stored as a model.
Then input text is processed word by word to see if particular model gives best match (much better then the other models).
In some cases all models provide quite bad match.
Few interesting points:
As we are working with social media both normalized and non-normalized matches are attempted (in this context normalization is removal of diacritics from symbols). Non-normalized matches have a higher weight
This method works rather bad on very short phrases (1-2 words) in particular when these words are there in few languages, which is the case of few European languages
Also for a better detection we are considering added per-character model as you have described (certain languages have certain unique characters)
Btw, we use ICU library to split words. Works rather good for European and Eastern languages (currently we support Chinese)
Check the Cavnar and Trenkle algorithm.

Resources