I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.
Related
I'm new to regular expression and got stuck with the code below.
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
output = re.findall((r'[A-Z][a-z]*'), s)[0]
output2 = re.findall(r'\b[^A-Z\s\d]+\b', s)
mixing = " ".join(str(x) for x in output2)
finalmix = output+" " + mixing
print(finalmix)
Here I'm trying to print "Consider the task in Figure 8.11, which are balanced in fig 99.2" from the given string s' as a sentence in output. So I joined the two outputs using join statement at the end to get it as a sentence. But its a lot confusing now since "Figure 8.11" and "fig 99.2" will not be printed as I have not given a regex code for that because I cannot determine what regex I should be using and later combining it at the end.
It's probably because I'm using a wrong approach to print the given sentence from the string s. I'll be glad if anyone could help me fix the code or guide me using some alternate approach as this code looks absurd.
This is the output I get:
Consider the task in . which are balanced in .
To capture all bulleted items, I would use:
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
items = re.findall(r'\d+\.(?!\d)(.*?)(?=\d+\.(?!\d)|$)', s, flags=re.DOTALL)
print(items)
This prints:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']
Here is an explanation of the regex pattern:
\d+\. match a bulleted number
(?!\d) which is NOT followed by another number
(.*?) match and capture all content, across newlines, until hitting
(?=\d+\.(?!\d)|$) another number bullet OR the end of the input
#TimBiegeleisen's answer works, but is somewhat verbose due to the fact that using re.findall would require repeating the pattern of the bullet point as a start and as a lookahead in the end.
For the purpose of finding strings between repeating patterns (bullet points in this case) it may be simpler to use re.split instead. Slice the resulting list to discard the first item since we don't need what comes before the first bullet point:
re.split(r'\d+\.(?!\d)\s*', s)[1:]
This returns:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']
Say I have the code txt = "Hello my name is bob. I really like pies.", how would I extract each sentence individually and add the to a list. I created this messy script which gives me a number of sentences roughly in a string...
sentences = 0
capitals = [
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S',
'T','U','V','W','X','Y','Z'
]
finish_markers = [
'.','?','!'
]
newTxt = txt.split()
for x in newTxt[1:-1]:
for caps in capitals:
if caps in x:
for fin in finish_markers:
if fin in newTxt[newTxt.index(x) - 1]:
sentences += 1
for caps in capitals:
if caps in newTxt[0]:
sentences += 1
print("Sentence count...")
print(sentences)
It is using the txt variable mentioned above. However I would now like to extract each sentence and put them into a list so the final product would look something like this...
['Hello my name is bob.','I really like pies.']
I would prefer not to use any non standard packages because I want this script to work independent of everything and offline. Thank you for any help!
Use nltk.tokenize
import nltk
sentences = nltk.sent_tokenize(txt)
This will give you a list of sentences.
You could work with a regex for all the ending chars(".","?","!")and then split it into different string.
You are trying to split a string into sentences, that is a bit hard to do it with regular expressions or string functions handling. For your use case, I'd recommend a NLP library like NLTK. Then, take a look at this Tokenize a paragraph into sentence and then into words in NLTK.
How do I use regular expressions isolate the words with ei or ie in it?
import re
value = ("How can one receive one who over achieves while believing that he/she cannot be deceived.")
list = re.findall("[ei,ie]\w+", value)
print(list)
it should print ['receive', 'achieves', 'believing', 'deceived'], but I get ['eceive', 'er', 'ieves', 'ile', 'elieving', 'eceived'] instead.
The set syntax [] is for individual characters, so use (?:) instead, with words separated by |. This is like using a group, but it doesn't capture a match group like () would. You also want the \w on either side to be captured to get the whole word.
import re
value = ("How can one receive one who over achieves while believing that he/she cannot be deceived.")
list = re.findall("(\w*(?:ei|ie)\w*)", value)
print(list)
['receive', 'achieves', 'believing', 'deceived']
(I'm assuming you meant "achieves", not "achieve" since that's the word that actually appears here.)
I'm trying to clean up a bunch of tweets so that they can be used for k-means clustering. I've written the following code that should strip each tweet of its unwanted characters.
from nltk.corpus import stopwords
import nltk
import json
with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
data = json.loads(f.readline())
tweets = []
for sentence in data:
tokens = nltk.wordpunct_tokenize(sentence['text'])
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha() and w not in
stopwords.words('english') and w is not 'the']
s = " "
useful_sentence = s.join(words)
tweets.append(useful_sentence)
print(tweets)
I'm trying to remove words like "I" and "the", but for some reason I can't figure out how. If I look at the tweets after they've gone through the loop, the word "the" still occurs.
Question: How is it possible that there are still occurences of "the" and "I" in the tweets? How should I fix this?
Beware of the processing order.
Here are two test strings for you:
THIS THE REMAINS.
this the is removed
Because "THE" is not "the". You lowercase after filtering, but you should first lowercase then filter.
The bad news for you: k-means works horribly bad on noisy short text like twitter. Because it is sensitive to noise, and the TFIDF vectors need very long texts to be reliable. So carefully verify your results, they probably are not as good as they may seem in the first enthusiasm.
Have you tried lowering w in check?
words = [w.lower() for w in text if w.isalpha() and w.lower() not in
stopwords.words('english') and w.lower() is not 'the']
is (and is not) is the (reference) identity check. It compares if two variable names point to the same object in memory. Typically this is only used to compare with None, or for some other speical cases.
In your case, use the != operator or the negation of == to compare with the string "the".
See also: Is there a difference between `==` and `is` in Python?
In Spacy 2.x, I use the matcher to find specific tokens in my text corpus. Each rule has an ID ('class-1_0' for example). During parse, I use the callback on_match to handle each match. Is there a solution to retrieve the rule used to find the match directly in the callback.
Here is my sample code.
txt = ("Aujourd'hui, je vais me faire une tartine au beurre "
"de cacahuète, c'est un pilier de ma nourriture "
"quotidienne.")
nlp = spacy.load('fr')
def on_match(matcher, doc, id, matches):
span = doc[matches[id][1]:matches[id][2]]
print(span)
# find a way to get the corresponding rule without fuzz
matcher = Matcher(nlp.vocab)
matcher.add('class-1_0', on_match, [{'LEMMA': 'pilier'}])
matcher.add('class-1_1', on_match, [{'LEMMA': 'beurre'}, {'LEMMA': 'de'}, {'LEMMA': 'cacahuète'}])
doc = nlp(txt)
matches = matcher(doc)
In this case matches return :
[(12071893341338447867, 9, 12), (4566231695725171773, 16, 17)]
12071893341338447867 is a unique ID based on class-1_0. I cannot find the original rule name, even if I do some introspection in matcher._patterns.
It would be great if someone can help me.
Thank you very much.
Yes – you can simply look up the ID in the StringStore of your vocabulary, available via nlp.vocab.strings or doc.vocab.strings. Going via the Doc is pretty convenient here, because you can do so within your on_match callback:
def on_match(matcher, doc, match_id, matches):
string_id = doc.vocab.strings[match_id]
For efficiency, spaCy encodes all strings to integers and keeps a reference to the mapping in the StringStore lookup table. In spaCy v2.0, the integers are hash values, so they'll always match across models and vocabularies. Fore more details on this, see this section in the docs.
Of course, if your classes and IDs are kinda cryptic anyways, the other answer suggesting integer IDs will work fine, too. Just keep in mind that those integer IDs you choose will likely also be mapped to some random string in the StringStore (like a word, or a part-of-speech tag or something). This usually doesn't matter if you're not looking them up and resolving them to strings somewhere – but if you do, the output may be confusing. For example, if your matcher rule ID is 99 and you're calling doc.vocab.strings[99], this will return 'VERB'.
While writing my question, as often, I found the solution.
It's dead simple, instead of using unicode rule id, like class-1_0, simply use a interger. The identifier will be preserved throughout the process.
matcher.add(1, on_match, [{'LEMMA': 'pilier'}])
Match with
[(1, 16, 17),]