spaCy Doc.sents not Splitting Correctly

spaCy Doc.sents not Splitting Correctly - nlp

In an NLP text summarization example, I've come across a weird situation. The example uses the spaCy library to process the text. I'm explaining the situation through the two cases below.
In the first case (see the first pic), spaCy doesn't split the sentences after the period character, as you see in the red outlined part, "won by the Whites.".
In the second case (see the second pic), after I've moved the sentence up, ending with "Whites.", spaCy does split the sentences after the period character, as you see in the red outlined part, "won by the Whites.,". Note that this time there is a comma at the end of the sentence ending with "Whites.". That means that this sentence has been split from the next sentence unlike in the first case.
I've observed this situation by moving the sentence to another position as well.
Nothing has come to my mind except this might be a bug. (I've copied the text to a text editor and then pasted to the notebook to make sure that there is not a special character next to the period.)
What do you think?
I'm sharing the notebook here so that you can play with it:
https://colab.research.google.com/drive/1MXRIrak0y680U84g0a0glpjX-clkkdtG?usp=sharing

I think your problem might be that it might be a list in the second one. But feel free to correct me if I'm wrong.

Related

Regex to find sentences containing a certain word and consider floating point numbers as part of sentence

I am searching a regex that can find a certain word in a sentence. However, the sentences in my case often contain (floating point) numbers. Common solutions for finding a word don't consider special types of numbers in them. (At least not those I found)
I have created one that does the job in my test case but has problems in other cases with catastrophic backtracking. In my regex I want to prevent catastrophic backtracking, but this exceeds my regex knowledge.
Here one of the versions of my regex: (\d+([.,]?\d+)*|[\w\s\(\[\{])*?(?<=[.?!;\s({[])\b[sS]cope\b[\s\S]*?(?<!\d)[.?!;](?!\d)
I know that there are other solutions out there, one being sentence tokenisation with nltk. That's the solution I used then. I ask this question out of interest and am only interested in regex solutions.
All example texts are from some random PDFs of the internet, hence the strange looking sentences.
Here one example where it works perfectly: https://regex101.com/r/12Xo8k/1
Here one example where it doesn't work and where the searched word is not contained: https://regex101.com/r/RvBYhL/1

Text semantic preprocessing

Let assume that I have a dataset of car accidents. Each accident has a textual description made using a set of cameras and other sensors.
Suppose now I have only the data of a single camera (e.g. the frontal) and I want to remove all the sentences of the description that are not related to it. I think a basic and easy solution could be to use a boolean retrieval system using a set of specific keywords to remove unwanted sentences, but I don't know neither if it is a good idea ner if it could work; could someone suggest me any idea? What kind of statistics might be useful to study this problem? Thanks

Regex could be one solution.
I created a regex matching the word "front", case insensitive, which searches for front and then get the whole sentences with one or more matches.
The results can be trimmed some from starting white spaces. (Can probably be removed as well with some fine tuning.)
The word you can swap out through some variable taking values from a list, if you need "front", "rear", "side", "right", "left" or other.
Regex Example https://regex101.com/r/ZHU0kr/5

Mallet topic modeling: remove most common words

I'm new with Mallet and topic modeling in the field of art history. I'm working with Mallet 2.0.8 and command line (I don't know yet Java). I'd like to remove most common and least common words (10 times in the whole corpus, as D. Mimno recommend) before training the model because the results aren't clean (even with the stoplist), which is not surprising.
I've found that prune command could be usefull, with options like prune-document-freq. Is it right? Or does it exist another way? Someone could explain me the whole procedure in details (for example: create/input Vectors2Vectors file and at which stage and then?)? It would be much appreciated!
I'm sorry for this question, I'm a beginner with Mallet and text mining! But it's quite exciting!
Thanks a lot for your help!

There are two places you can use Mallet to curate the vocabulary. The first is in data import, for example the import-file command. The --remove-stopwords option removes a fixed set of English stopwords. This is here for backwards compatibility reasons, and is probably not a bad idea for some English-language prose, but you can generally do better by creating a custom lists. I would recommend using instead the --stoplist-file option along with the name of a file. All words in this file, separated by spaces and/or newlines, will be removed. (Using both options will remove the union of the two lists, probably not what you want.) Another useful option is --replacement-files, which allows you to specify multi-word strings to treat as single words. For example, this file:
black hole
white dwarf
will convert "black hole" into "black_hole". Here newlines are treated differently from spaces. You can also specify multi-word stopwords with --deletion-files.
Once you have a Mallet file, you can modify that file with the prune command. --prune-count N will remove words that occur fewer than N times in any document. --prune-document-freq N will remove words that occur at least once in N documents. This version can be more robust against words that occur a lot in one document. You can also prune by proportion: --min-idf removes infrequent words, --max-idf removes frequent words. A word with IDF 10.0 occurs less than once in 20000 documents, a word with IDF below 2.0 occurs in more than 13% of the collection.

Extracting context around a word in sentence

Assume I have a very long text and I'd like to extract a certain length of context around a specific word. For example in the following text I'd like to extract 8 words around the word warrior.
........
........
... died. He was a very brave warrior, fighting for freedom against the odds ...
........
........
In this case the result would be
He was a very brave warrior, fighting for freedom
Notice how I dropped the word died as I'd prefer starting from the beginning of a full sentence and how I extracted more than just 8 words because fight for freedom is much more meaningful than just fighting for.
Are there any algorithms, or research conducted in this field that I could follow? How should I go about approaching this problem.

You can use RegEx to get whole sentence that contains word you are looking for.
Then use Information Extraction algorithm to find more convenient 8 words.
I found some Python realisation of both
For regexp look here
And for Extracting algorithm look here
Hope this will help you

Let's divide your problem into parts and keep it independent of any programming language:
If you want the word fight instead of fighting, you should preprocess your data. Please take a look at lemmatization and stemming techniques which will give you the root words.
Also, another text preprocessing step would be to eliminate the stop words from your text. Words such as the, will, if, but etc will be removed.
Now to extract n-words, you can define a window size that will extract n number of words from your sentence text. So all you have to do is, write a function that will take the target text and word around which you want to extract the words. Iterate this loop over your entire text.
Hope this helps.

Dividing string of characters to words and sentences (English only)

I'm looking for a solution to following task. I take few random pages from random book in English and remove all non letter characters and convert all chars to lower case. As a result I have something like:
wheniwasakidiwantedtobeapilot...
Now what I'm looking for is something that could reverse that process with quite a good accuracy. I need to find words and sentence separators. Any ideas how to approach this problem? Are there existing solutions I can base on without reinventing the wheel?

This is harder than normal tokenization since the basic tokenization task assumes spaces. Basically all that normal tokenization has to figure out is, for example, whether punctuation should be part of a word (like in "Mr.") or separate (like at the end of a sentence). If this is what you want, you can just download the Stanford CoreNLP package which performs this task very well with a rule-based system.
For your task, you need to figure out where to put in the spaces. This tutorial on Bayesian inference has a chapter on word segmentation in Chinese (Chinese writing doesn't use spaces). The same techniques could be applied to space-free English.
The basic idea is that you have a language model (an N-Gram would be fine) and you want to choose a splitting that maximizes the probability the data according to the language model. So, for example, placing a space between "when" and "iwasakidiwantedtobeapilot" would give you a higher probability according to the language model than placing a split between "whe" and "niwasakidiwantedtobeapilot" because "when" is a better word than "whe". You could do this many times, adding and removing spaces, until you figured out what gave you the most English-looking sentence.
Doing this will give you a long list of tokens. Then when you want to split those tokens into sentences you can actually use the same technique except instead of using a word-based language model to help you add spaces between words, you'll use a sentence-based language model to split that list of tokens into separate sentences. Same idea, just on a different level.

The tasks you describe are called "words tokenization" and "sentence segmentation". There are a lot of literature about them in NLP. They have very simple straightforward solutions, as well as advanced probabilistic approaches based on language model. Choosing one depends on your exact goal.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string