OpenNLP SentenceDetector doesn't recognize whole sentence - nlp

I'm working on a research project and I need a NLP program to detect sentences in many different circumstances. I was advised to use OpenNLP and I am convinced to use it after reading it's wiki pages. So, I use OpenNLP in order to detect sentences as well as any words or phrases which are not belong to a sentence (also called sentence fragments).
OpenNLP accepts .txt files as input if you want to redirect the input. If you want to use .doc file as input, you have to convert it to a .txt file. My problem starts right here.
I have many different files in different formats. I would like to detect sentences in each file if they consist any text. Therefore, I started to convert each potentially text containing file to a .txt file. The conversion process is not perfect. For example, if a sentence too long (say longer than a line), then conversion tool gets the both lines of the sentence as separated sentences. This results OpenNLP produces each line as different sentences because of eoln character at the end of the first line.
My question is, is there anyway that I can parameterize or configure OpenNLP to recognize whole sentence (first and second line together)?

I suggest you, use apache Tika for that conversion of different files.
Apache Tika has AutoDetectParser which detects different file types and extracts the data in it (Even metadata if you want) and you can save that into a .txt file.

Try your paragraph with new lines replaced with spaces with CoreNLP: nlp.stanford.edu:8080/corenlp/process

Related

How to use NLP to detect sentences in a long text?

I am using automatic speech recognition to extract text from an audio file. However, the output is just a long sequence of words with no punctuation whatsoever. What I'd like to do is use some NLP technique to estimate beginnings and endings of sentences, or, in other words, predict positions of punctuation markers. I found that CoreNLP can do sentence splitting, but apparently only if punctuation is already present.
You may find relevant info in the answers to this other question: Sentence annotation in text without punctuation.
In particular, one of the answers claims the deepsegment package works well on unpunctuated text.
In spoken language you often find that people don't use sentences, but that the clauses simply run into each other. The degree to which this happens depends on the formality and setting -- a speech will conform more to written sentence structures than a conversation in a pub among friends.
One approach you could try is to identify words that typically begin/end sentences in written text, and see if that can help you segmenting your data. Or look for verbs, and then try to find boundaries between them; this might be clause boundaries rather than sentence boundaries, but as I said, in spoken language there often are no sentences.

Microsoft translator engine customization: parallel txt files

I am trying to perform some NMT engine customization for Japanese but I am having some difficulties uploading parallel txt files. I've gathered 10k parallel sentences and I've put them into two txt files:
As the guide suggested, I've also been careful to remove sentences containing the \n and \r characters in them, but upon uploading I get the following:
What's wrong?
We display the sentence counts because the model training engine operates at the sentence level. The expected format of the txt parallel file set is one sentence for each line. During the upload process we do run a sentence breaker which identifies end of sentence markers and breaks accordingly. This is why the count of sentences do not always match the count of lines. Sentences are the units we operate on, not lines of the input file. That's why we focus on sentences rather than lines.
This is also why we suggest removing newline characters within sentences. The newline is considered an end of sentence marker, so having newlines within a sentence creates a false sentence break.
In response to your second concern, we do run a sentence aligning process on most data that is submitted. If there is an inconsistent number of sentences in the uploaded parallel files we can usually get most of the sentence pairs, as long as the sentences are fairly close.
After some "debugging" I've noticed that the number shown in the portal is the number of sentences (instead of lines, my bad!). I find it kind of confusing (and not really useful in my opinion). What would be the usefulness of displaying this information?
In addition, I've noticed that there is no warning if you upload one file containing less lines than the second file (which would make the parallel files not parallel anymore - the whole point of parallel files is to have X lines in the source file, and X lines on the target file). It would be helpful if at least a warning was shown to prevent mistakes (if you use parallel files and len(f1)!=len(f2) it's a great indicator that something is off)

Mallet topic modeling: remove most common words

I'm new with Mallet and topic modeling in the field of art history. I'm working with Mallet 2.0.8 and command line (I don't know yet Java). I'd like to remove most common and least common words (10 times in the whole corpus, as D. Mimno recommend) before training the model because the results aren't clean (even with the stoplist), which is not surprising.
I've found that prune command could be usefull, with options like prune-document-freq. Is it right? Or does it exist another way? Someone could explain me the whole procedure in details (for example: create/input Vectors2Vectors file and at which stage and then?)? It would be much appreciated!
I'm sorry for this question, I'm a beginner with Mallet and text mining! But it's quite exciting!
Thanks a lot for your help!
There are two places you can use Mallet to curate the vocabulary. The first is in data import, for example the import-file command. The --remove-stopwords option removes a fixed set of English stopwords. This is here for backwards compatibility reasons, and is probably not a bad idea for some English-language prose, but you can generally do better by creating a custom lists. I would recommend using instead the --stoplist-file option along with the name of a file. All words in this file, separated by spaces and/or newlines, will be removed. (Using both options will remove the union of the two lists, probably not what you want.) Another useful option is --replacement-files, which allows you to specify multi-word strings to treat as single words. For example, this file:
black hole
white dwarf
will convert "black hole" into "black_hole". Here newlines are treated differently from spaces. You can also specify multi-word stopwords with --deletion-files.
Once you have a Mallet file, you can modify that file with the prune command. --prune-count N will remove words that occur fewer than N times in any document. --prune-document-freq N will remove words that occur at least once in N documents. This version can be more robust against words that occur a lot in one document. You can also prune by proportion: --min-idf removes infrequent words, --max-idf removes frequent words. A word with IDF 10.0 occurs less than once in 20000 documents, a word with IDF below 2.0 occurs in more than 13% of the collection.

How to automatically detect sentence fragments in a text file

I am working on a project and need a tool or an API in order to detect sentence fragments in large text. There are many solutions such as OpenNLP for detecting sentences in given file. However, I wasn't able to find any explicit solution to the problem of finding words, phrases or event character combinations which are not belong to any grammatically correct sentences.
Any help will be greatly appreciated.
Thanks,
Lorderon
you could use n-grams as a work around:
Suppose you have a large collection of text with real sentences for reference. You could extract all sequences of 1,2,3,4,5, or more words and then in your text double check if the fragments from your text exist as n-grams.
you can download n-grams directly from google: http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html but you might need a lot of traffic.
You could also count the n-grams yourself in this case you can take the parsed data sets of the wikipedia from my website:
http://glm.rene-pickhardt.de/data/ and the source code from https://github.com/renepickhardt/generalized-language-modeling-toolkit in order to create the ngrams yourself (or any other ngram toolkit like srilm, kylm, opengrm,...)

Finding words from a dictionary in a string of text

How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.

Resources