Extracting context around a word in sentence - nlp

Assume I have a very long text and I'd like to extract a certain length of context around a specific word. For example in the following text I'd like to extract 8 words around the word warrior.
........
........
... died. He was a very brave warrior, fighting for freedom against the odds ...
........
........
In this case the result would be
He was a very brave warrior, fighting for freedom
Notice how I dropped the word died as I'd prefer starting from the beginning of a full sentence and how I extracted more than just 8 words because fight for freedom is much more meaningful than just fighting for.
Are there any algorithms, or research conducted in this field that I could follow? How should I go about approaching this problem.

You can use RegEx to get whole sentence that contains word you are looking for.
Then use Information Extraction algorithm to find more convenient 8 words.
I found some Python realisation of both
For regexp look here
And for Extracting algorithm look here
Hope this will help you

Let's divide your problem into parts and keep it independent of any programming language:
If you want the word fight instead of fighting, you should preprocess your data. Please take a look at lemmatization and stemming techniques which will give you the root words.
Also, another text preprocessing step would be to eliminate the stop words from your text. Words such as the, will, if, but etc will be removed.
Now to extract n-words, you can define a window size that will extract n number of words from your sentence text. So all you have to do is, write a function that will take the target text and word around which you want to extract the words. Iterate this loop over your entire text.
Hope this helps.

Related

How to automate finding sentences similar to the ones from a given list?

I have a list of let's say "forbidden sentences" (1000 of them, each with around 40 words). I want to create a tool that will find and mark them in a given document.
The problem is that in such document this forbidden sentence can be expressed differently than it is on this list keeping the same meaning but changed by using synonyms, a few words more or less, different word order, punctuation, grammar etc. The fact that this is all in Polish is not making things easier with each noun, pronoun, and adjective having 14 cases in total plus modifiers and gender that changes the words further. I was also thinking about making it so that the found sentences are ranked by the probability of them being forbidden with some displaying less resemblance.
I studied IT for two years but I don't have much knowledge in NLP. Do you think this is possible to be done by an amateur? Could you give me some advice on where to start, what tools to use best to put it all together? No need to be fancy, just practical. I was hoping to find some ready to use code cause i imagine this is sth that was made before. Any ideas where to find such resources or what keywords to use while searching? I'd really appreciate some help cause I'm very new to this and need to start with the basics.
Thanks in advance,
Kamila
Probably the easiest first try will be to use polish SpaCy, which is an extension of popular production-ready NLP library to support polish language.
http://spacypl.sigmoidal.io/#home
You can try to do it like this:
Split document into sentences.
Clean these sentences with spacy (deleting stopwords, punctuation, doing lemmatization - it will help you with many differnet versions of the same word)
Clean "forbidden sentences" as well
Prepare vector representation of each sentence - you can use spaCy methods
Calculate similarity between sentences - cosine similarity
You can set threshold, from which if sentences of document is similar to any of "forbidden sentences" it will be treated as forbidden
If anything is not clear let me know.
Good luck!

A good word splitter

I have a set of short strings (average length < 12).
The strings are mostly sequence of English words (names, dict words etc).
However there is no delimiter between the words. I want to split each string into individual words. I tried google but didn't find anything.
Is there any standard way to do that? Also where can I get dictionary which also includes name of person, along with other English words.
Please note: The strings might not adhere to grammatical rules of English.
Examples of Strings are given below:
dontdisturb
ilovejane
iamagoodperson
It is a known problem for Twitter content/hashtags, though there is no standard/universally accepted way to solve it. (I would also suggest changing the topic to "hashtag splitter" if it is your problem, then more people would be able to find it.)
The algorithm I would suggest is the one typically used for segmentation of Chinese (which has a very similar issue as you can imagine). Here is the idea:
1.Try finding all substrings that can be found in a dictionary, give them the highest score.
2.Then add sequences accepted by some English heuristic with a lower score.
3.And finally throw in individual letters or syllables found in the remainder, with the lowest score.
4.Use Viterbi algorithm (or here) to find the best non-overlapping coverage of the string with the highest score.

How to automatically detect sentence fragments in a text file

I am working on a project and need a tool or an API in order to detect sentence fragments in large text. There are many solutions such as OpenNLP for detecting sentences in given file. However, I wasn't able to find any explicit solution to the problem of finding words, phrases or event character combinations which are not belong to any grammatically correct sentences.
Any help will be greatly appreciated.
Thanks,
Lorderon
you could use n-grams as a work around:
Suppose you have a large collection of text with real sentences for reference. You could extract all sequences of 1,2,3,4,5, or more words and then in your text double check if the fragments from your text exist as n-grams.
you can download n-grams directly from google: http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html but you might need a lot of traffic.
You could also count the n-grams yourself in this case you can take the parsed data sets of the wikipedia from my website:
http://glm.rene-pickhardt.de/data/ and the source code from https://github.com/renepickhardt/generalized-language-modeling-toolkit in order to create the ngrams yourself (or any other ngram toolkit like srilm, kylm, opengrm,...)

Dividing string of characters to words and sentences (English only)

I'm looking for a solution to following task. I take few random pages from random book in English and remove all non letter characters and convert all chars to lower case. As a result I have something like:
wheniwasakidiwantedtobeapilot...
Now what I'm looking for is something that could reverse that process with quite a good accuracy. I need to find words and sentence separators. Any ideas how to approach this problem? Are there existing solutions I can base on without reinventing the wheel?
This is harder than normal tokenization since the basic tokenization task assumes spaces. Basically all that normal tokenization has to figure out is, for example, whether punctuation should be part of a word (like in "Mr.") or separate (like at the end of a sentence). If this is what you want, you can just download the Stanford CoreNLP package which performs this task very well with a rule-based system.
For your task, you need to figure out where to put in the spaces. This tutorial on Bayesian inference has a chapter on word segmentation in Chinese (Chinese writing doesn't use spaces). The same techniques could be applied to space-free English.
The basic idea is that you have a language model (an N-Gram would be fine) and you want to choose a splitting that maximizes the probability the data according to the language model. So, for example, placing a space between "when" and "iwasakidiwantedtobeapilot" would give you a higher probability according to the language model than placing a split between "whe" and "niwasakidiwantedtobeapilot" because "when" is a better word than "whe". You could do this many times, adding and removing spaces, until you figured out what gave you the most English-looking sentence.
Doing this will give you a long list of tokens. Then when you want to split those tokens into sentences you can actually use the same technique except instead of using a word-based language model to help you add spaces between words, you'll use a sentence-based language model to split that list of tokens into separate sentences. Same idea, just on a different level.
The tasks you describe are called "words tokenization" and "sentence segmentation". There are a lot of literature about them in NLP. They have very simple straightforward solutions, as well as advanced probabilistic approaches based on language model. Choosing one depends on your exact goal.

Large free block of english non-pronoun text

As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.
I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.
Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?
(My initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).
It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.
Many thanks for any links or suggestions :)
Would this be useful?
Have you tried /usr/share/dict/words?
Create text list manually
Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.
Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.
Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.
Use Wordnet
Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.

Resources