How to automatically detect sentence fragments in a text file - nlp

I am working on a project and need a tool or an API in order to detect sentence fragments in large text. There are many solutions such as OpenNLP for detecting sentences in given file. However, I wasn't able to find any explicit solution to the problem of finding words, phrases or event character combinations which are not belong to any grammatically correct sentences.
Any help will be greatly appreciated.
Thanks,
Lorderon

you could use n-grams as a work around:
Suppose you have a large collection of text with real sentences for reference. You could extract all sequences of 1,2,3,4,5, or more words and then in your text double check if the fragments from your text exist as n-grams.
you can download n-grams directly from google: http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html but you might need a lot of traffic.
You could also count the n-grams yourself in this case you can take the parsed data sets of the wikipedia from my website:
http://glm.rene-pickhardt.de/data/ and the source code from https://github.com/renepickhardt/generalized-language-modeling-toolkit in order to create the ngrams yourself (or any other ngram toolkit like srilm, kylm, opengrm,...)

Related

How to use NLP to detect sentences in a long text?

I am using automatic speech recognition to extract text from an audio file. However, the output is just a long sequence of words with no punctuation whatsoever. What I'd like to do is use some NLP technique to estimate beginnings and endings of sentences, or, in other words, predict positions of punctuation markers. I found that CoreNLP can do sentence splitting, but apparently only if punctuation is already present.
You may find relevant info in the answers to this other question: Sentence annotation in text without punctuation.
In particular, one of the answers claims the deepsegment package works well on unpunctuated text.
In spoken language you often find that people don't use sentences, but that the clauses simply run into each other. The degree to which this happens depends on the formality and setting -- a speech will conform more to written sentence structures than a conversation in a pub among friends.
One approach you could try is to identify words that typically begin/end sentences in written text, and see if that can help you segmenting your data. Or look for verbs, and then try to find boundaries between them; this might be clause boundaries rather than sentence boundaries, but as I said, in spoken language there often are no sentences.

How to automate finding sentences similar to the ones from a given list?

I have a list of let's say "forbidden sentences" (1000 of them, each with around 40 words). I want to create a tool that will find and mark them in a given document.
The problem is that in such document this forbidden sentence can be expressed differently than it is on this list keeping the same meaning but changed by using synonyms, a few words more or less, different word order, punctuation, grammar etc. The fact that this is all in Polish is not making things easier with each noun, pronoun, and adjective having 14 cases in total plus modifiers and gender that changes the words further. I was also thinking about making it so that the found sentences are ranked by the probability of them being forbidden with some displaying less resemblance.
I studied IT for two years but I don't have much knowledge in NLP. Do you think this is possible to be done by an amateur? Could you give me some advice on where to start, what tools to use best to put it all together? No need to be fancy, just practical. I was hoping to find some ready to use code cause i imagine this is sth that was made before. Any ideas where to find such resources or what keywords to use while searching? I'd really appreciate some help cause I'm very new to this and need to start with the basics.
Thanks in advance,
Kamila
Probably the easiest first try will be to use polish SpaCy, which is an extension of popular production-ready NLP library to support polish language.
http://spacypl.sigmoidal.io/#home
You can try to do it like this:
Split document into sentences.
Clean these sentences with spacy (deleting stopwords, punctuation, doing lemmatization - it will help you with many differnet versions of the same word)
Clean "forbidden sentences" as well
Prepare vector representation of each sentence - you can use spaCy methods
Calculate similarity between sentences - cosine similarity
You can set threshold, from which if sentences of document is similar to any of "forbidden sentences" it will be treated as forbidden
If anything is not clear let me know.
Good luck!

Building your own text corpus

It may sounds stupid, but do you know how to build text corpus? I have searched everywhere and there is already existing corpus, but I wonder how did they build it? For example, if I want to build corpus with positive and negative tweets, then I have to just make two files? But what about inner of those files? Dont get it((((
in this example he stores pos and neg tweets in RedisDB.
But what about inner of those files?
This depends mostly on what library you're using. XML (with a variety of tags) is common, as is one sentence per line. The tricky part is getting the data in the first place.
For example, if I want to build corpus with positive and negative tweets
Does this mean that you want to know how to mark the tweets as positive and negative? If so, what you're looking for is called text classification or semantic analysis.
If you want to find a bunch of tweets, I'd check one of these pages (just from a quick search of my own).
Clickonf5: http://clickonf5.org/5438/download-tweets-pdf-xml-format-local-machine-server/
Quora: http://quora.com/What-is-the-best-tool-to-download-and-archive-Twitter-data-of-certain-hashtags-and-mentions-for-academic-research
Google Groups: http://groups.google.com/forum/?fromgroups#!topic/twitter-development-talk/kfislDfxunI
For general learning about how to create a corpus, I would check out the Handbook of Natural Language Processing Wiki by Richard Xiao.

Finding words from a dictionary in a string of text

How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.

How can I analyze pieces of text for positive or negative words?

I'm looking for some sort of module (preferably for python) that would allow me to give that module a string about 200 characters long. The module should then return how many positive or negative words that string had. (e.g. love, like, enjoy vs. hate, dislike, bad)
I'd really like to avoid having to reinvent the wheel in natural language processing, so if there is anything you guys know of that would allow me to do what I described above, it'd be a huge time-saver if you could share.
Thanks for the help!
I think you're looking for sentiment analysis. Here's a Twitter sentiment app.
Here's a question about sentiment analysis using Python.
Before you analyse pieces of text you need to preprocess given text by striping punctuation, repair language, split spaces,lower the whole text and store the words in an iterable data structure.
For some basic sentiment analysis, following techniques can be used:
Bag of words
In bag of words technique we basically go through a bag(file) of words and check if the iterable made by us contains these. If it does then we assign some value to each word's presence in order to weigh the total sentiment of the text.
This link should help you understand more about this
https://en.wikipedia.org/wiki/Bag-of-words_model
Keyword Extraction and Tagging
Keywords and important information can be extracted from the input text by tagging the elements and then removing unwanted data.
For example:
My name is John.
Here John, name are the information and "is" isn't really needed.
Similarly verbs and other unimportant things can be removed in order to retain only the main information.
Chunking and Chinking helps.
This link must be of help.
http://nltk.org/book/ch07.html
You can tokenize your text and get the sentiment using existing sentiment analysis tools. The most comprehensive sentiment analysis tool that I know is SentiBench. This is basically a survey study of all sentiment analysis tools. As well as the code and examples on how to use the code.

Resources