Say I have the code txt = "Hello my name is bob. I really like pies.", how would I extract each sentence individually and add the to a list. I created this messy script which gives me a number of sentences roughly in a string...
sentences = 0
capitals = [
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S',
'T','U','V','W','X','Y','Z'
]
finish_markers = [
'.','?','!'
]
newTxt = txt.split()
for x in newTxt[1:-1]:
for caps in capitals:
if caps in x:
for fin in finish_markers:
if fin in newTxt[newTxt.index(x) - 1]:
sentences += 1
for caps in capitals:
if caps in newTxt[0]:
sentences += 1
print("Sentence count...")
print(sentences)
It is using the txt variable mentioned above. However I would now like to extract each sentence and put them into a list so the final product would look something like this...
['Hello my name is bob.','I really like pies.']
I would prefer not to use any non standard packages because I want this script to work independent of everything and offline. Thank you for any help!
Use nltk.tokenize
import nltk
sentences = nltk.sent_tokenize(txt)
This will give you a list of sentences.
You could work with a regex for all the ending chars(".","?","!")and then split it into different string.
You are trying to split a string into sentences, that is a bit hard to do it with regular expressions or string functions handling. For your use case, I'd recommend a NLP library like NLTK. Then, take a look at this Tokenize a paragraph into sentence and then into words in NLTK.
Related
I am quite new to python.
And i want to only get a certain format from a bigger list, example:
Whats in the list:
/ABC/EF213
/ABC/EF
/ABC/12AC4
/ABC/212
However the only on i want listed are the ones with this format /###/##### while the rest gets discarded
You could use a generator expression or a for loop to check each element of the list to see if it matches a pattern. One way of doing this would be to check if the item matches a regex pattern.
As an example:
import re
original_list = ["Item I don't want", "/ABC/EF213", "/ABC/EF", "/ABC/12AC4", "/ABC/212", "123/456", "another useless item", "/ABC/EF"]
filtered_list = [item for item in original_list if re.fullmatch("\/\w+\/\w+", item) is not None]
print(filtered_list)
outputs
['/ABC/EF213', '/ABC/EF', '/ABC/12AC4', '/ABC/212', '/ABC/EF']
If you need help making regex patterns, there are many great websites such as regexr which can help you
Every String can be used as a list without any conversion. If the only format you want to check is /###/##### then you can simply make if commands like these:
for text in your_list:
if len(text) == 10 and text[0] == "/" and text[4] == "/" (and so on):
print(text)
Of course this would require a lot of if statements and would take a pretty long time. So I would recomend doing a faster and simpler scan. We could perform this one by, for example, splitting the texts, which would look something like this:
for text in your_list:
checkstring = text.split("/")
Now you have your text Split in parts, and you can simply check what lengths these new parts have with the len() command.
I have few patient's medical record text files which i got from the internet and i want to identify/find the files which are bad quality(misspelled words/special characters between the words/Erroneous words) and files with good quality(clean text).i want to build error detection model using text mining/NLP.
1)can someone please help me on the approach and solution for feature extraction and model selection.
2)Is there any medical corpus for medical records to identify the misspelled/Erroneous words.
If your goal is to simply correct these misspelled words to improve performance on whatever downstream task you want to do, then I can suggest a simple approach which has worked sufficiently well for me.
First tokenize your text (I recommend scispacy for medical text)
Identify possible "bad quality" words simply by the count of each unique word constructed from all the words in your corpus e.g. all words that occur <= 3 times
Add words that occur > 3 times in your corpus (we assume these are all correctly spelled) to a regular English dictionary. If your corpus is large, this is perfectly adequate for capturing medical terms. Otherwise use a medical dictionary e.g. UMLS, or https://github.com/glutanimate/wordlist-medicalterms-en to add the medical words not in a regular dictionary
Use pyspellchecker to identify the misspellings by using the Levenshtein Distance algorithm and comparing against our dictionary.
Replace the typos with what pyspellchecker thinks they should be.
A basic example:
import spacy
import scispacy
from collections import Counter
from spellchecker import SpellChecker
nlp = spacy.load('en_core_sci_md') # sciSpaCy
word_freq = Counter()
for doc in corpus:
tokens = nlp.tokenizer(doc)
tokenised_text = ""
for token in tokens:
tokenised_text = tokenised_text + token.text + " "
word_freq.update(tokenised_text.split())
infreq_words = [word for word in word_freq.keys() if word_freq[word] <= 3 and word[0].isdigit() == False]
freq_words = [word for word in word_freq.keys() if word_freq[word] > 3]
add_to_dictionary = " ".join(freq_words)
f=open("medical_dict.txt", "w+")
f.write(add_to_dictionary)
f.close()
spell = SpellChecker()
spell.distance = 1 # set the distance parameter to just 1 edit away - much quicker
spell.word_frequency.load_text_file('medical_dict.txt')
misspelled = spell.unknown(infreq_words)
misspell_dict = {}
for i, word in enumerate(misspelled):
if (word != spell.correction(word)):
misspell_dict[word] = spell.correction(word)
print(list(misspell_dict.items())[:10])
I would also recommend using regular expressions to fix any other "bad quality" words which can be systematically corrected.
You can do biobert to do contextual spelling check,
Link: https://github.com/dmis-lab/biobert
I am using spaCy to process sentences of a doc. Given one sentence, I'd like to get the previous and following sentence.
I can easily iterate over the sentences of the doc as following:
nlp_content = nlp(content)
sentences = nlp_content.sents
for idx, sent in enumerate(sentences):
But I can't get the sentence #idx-1 or #idx+1 from the sentence #idx.
Is there any function or property that could be useful there?
Thanks!
Nick
There isn't a built-in sentence index. You would need to iterate over the sentences once to create your own list of sentence spans to access them this way.
sentence_spans = tuple(doc.sents) # alternately: list(doc.sents)
I'm trying to clean up a bunch of tweets so that they can be used for k-means clustering. I've written the following code that should strip each tweet of its unwanted characters.
from nltk.corpus import stopwords
import nltk
import json
with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
data = json.loads(f.readline())
tweets = []
for sentence in data:
tokens = nltk.wordpunct_tokenize(sentence['text'])
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha() and w not in
stopwords.words('english') and w is not 'the']
s = " "
useful_sentence = s.join(words)
tweets.append(useful_sentence)
print(tweets)
I'm trying to remove words like "I" and "the", but for some reason I can't figure out how. If I look at the tweets after they've gone through the loop, the word "the" still occurs.
Question: How is it possible that there are still occurences of "the" and "I" in the tweets? How should I fix this?
Beware of the processing order.
Here are two test strings for you:
THIS THE REMAINS.
this the is removed
Because "THE" is not "the". You lowercase after filtering, but you should first lowercase then filter.
The bad news for you: k-means works horribly bad on noisy short text like twitter. Because it is sensitive to noise, and the TFIDF vectors need very long texts to be reliable. So carefully verify your results, they probably are not as good as they may seem in the first enthusiasm.
Have you tried lowering w in check?
words = [w.lower() for w in text if w.isalpha() and w.lower() not in
stopwords.words('english') and w.lower() is not 'the']
is (and is not) is the (reference) identity check. It compares if two variable names point to the same object in memory. Typically this is only used to compare with None, or for some other speical cases.
In your case, use the != operator or the negation of == to compare with the string "the".
See also: Is there a difference between `==` and `is` in Python?
I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.