Python - Wiki pages into sentences - python-3.x

I am trying to read a wiki page, collect and enumerate all sentences.
#read the wiki page
import wikipedia
eliz = wikipedia.page("Elizabeth II")
fullText2=eliz.content
m = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',fullText2)
docs=[]
for i in m:
print (i)
docs.append(i)
But it seems it doesn't work properly to split sentences:
for example I get this in the prints as a whole!:
"Elizabeth received private tuition in constitutional history from
Henry Marten, Vice-Provost of Eton College, and learned French from a
succession of native-speaking governesses. A Girl Guides company, the
1st Buckingham Palace Company, was formed specifically so she could
socialise with girls her own age. Later, she was enrolled as a Sea
Ranger.In 1939, Elizabeth's parents toured Canada and the United
States. As in 1927, when her parents had toured Australia and New
Zealand, Elizabeth remained in Britain, since her father thought her
too young to undertake public tours. Elizabeth "looked tearful" as her
parents departed. They corresponded regularly, and she and her parents
made the first royal transatlantic telephone call on 18 May."

Related

Replace specific text with a redacted version using Python

I am looking to do the opposite of what has been done here:
import re
text = '1234-5678-9101-1213 1415-1617-1819-hello'
re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)
output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'
Partial replacement with re.sub()
My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.
The end result would look like:
XXXX went to XXXX XXXXXX
Sponge Bob went to Disney World.
In short, I am unmasking text and replacing it with a generated dataset using fuzzy.
You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.
NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.
Example:
Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.
Returns:
Just be aware that this is not 100%!
Here are a little snippet for you to try out:
import spacy
phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
doc = nlp(phrase)
replaced = ""
for token in doc:
if token in doc.ents:
replaced+="XXXX "
else:
replaced+=token.text+" "
Read more here: https://spacy.io/usage/linguistic-features#named-entities
You could, instead of replacing with XXXX, replace based on the entity type, like:
if ent.label_ == "PERSON":
replaced += "<PERSON> "
Then:
import re, random
personames = ["Jack", "Mike", "Bob", "Dylan"]
phrase = re.replace("<PERSON>", random.choice(personames), phrase)

Grouping Similar words in python

I'm trying to extract keywords/entity names from a text using spacy.
I'm able to extract all the entity names but I'm getting a lot of duplicates.
For example,
def keywords(text):
tags = bla_bla(text)
return tags
article = "Donald Trump. Trump. Trump. Donald. Donald J Trump."
tags = keywords(article)
The output I'm getting is:
['Donald Trump', 'Trump', 'Trump', 'Donald', 'Donald J Trump']
How do I cluster all these tags under a master tag 'Donald J Trump'?
1) Easy way: keep only the longest entities that contains smaller
2) More time consuming: make dictionaries of entities
3) ML: vectorize entities with bag of words and cluster them, the longest entity in cluster will be "main"
With a carefully built dictionary.
There is no unsupervised/clustering that would do this reliably.
Consider the following sentence:
President Trump met with his son, Donald Trump Jr.

In Python, how do I split a string after periods without affecting decimal numbers?

Suppose I have a string:
string1 = 'Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.'
I have a large amount of articles to work with in which the periods don't always have a space after them, but some do. How do I split the text into sentences without splitting up decimal numbers? TIA.
One way to do this is to protect the dots you don't want your text splitted at by replacing them with something else first, then re-replacing the placeholder back again after the split:
import re
# replace dots that have numbers around them with "[PROTECTED_DOT]"
string1_protected = re.sub(r"(\d)\.(\d)", r"\1[PROTECTED_DOT]\2", string1)
# now split (and remove empty lines)
lines_protected = [line + "." for line in string1_protected.split(".") if line]
# now re-replace all "[PROTECTED_DOT]"s
lines = [line.replace("[PROTECTED_DOT]", ".") for line in lines_protected]
The result:
In [1]: lines
Out[1]: ['Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.',
"Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."]
This can be accomplished using regular expressions, re.split(), assuming no declarative sentence ends in a number and is followed by a sentence beginning with a number, without a space between the sentences (e.g., "This is my sentence ending in 1.2 is the start of my next sentence."; the first sentence ends in "1." and the next begins with "2").
That being said, split() alone will not be able to perform the desired action. It is also worth noting that, since apostrophes are more common than quotation marks, delimiting your string with quotation marks will likely be better. As it stands now, the very end of your sentence, "s Pernod Richard.", is not considered part of the string, and is therefore considered invalid syntax.
string1 = "Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."
sentences = re.split('[^0-9]["."][^0-9]', string1)

BeautifulSoup get_text returns NoneType object

I'm trying BeautifulSoup for web scraping and I need to extract headlines from this webpage, specifically from the 'more' headlines section. This is the code I've tried using so far.
import requests
from bs4 import BeautifulSoup
from csv import writer
response = requests.get('https://www.cnbc.com/finance/?page=1')
soup = BeautifulSoup(response.text,'html.parser')
posts = soup.find_all(id='pipeline')
for post in posts:
data = post.find_all('li')
for entry in data:
title = entry.find(class_='headline')
print(title)
Running this code gives me ALL the headlines in the page in the following output format:
<div class="headline">
<a class=" " data-nodeid="105372063" href="/2018/08/02/after-apple-rallies-to-1-trillion-even-the-uber-bullish-crowd-on-wal.html">
{{{*HEADLINE TEXT HERE*}}}
</a> </div>
However, if I use the get_text() method while fetching title in the above code, I only get the first two headlines.
title = entry.find(class_='headline').get_text()
Followed by this error:
Traceback (most recent call last):
File "C:\Users\Tanay Roman\Documents\python projects\scrapper.py", line 16, in <module>
title = entry.find(class_='headline').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
Why does adding the get_text() method only return partial results. And how do I solve it?
You are misunderstanding the error message. It is not that the .get_text() call returns a NoneType object, it is that objects of type NoneType do not have that method.
There is only ever exactly one object of type NoneType, the value None. Here it was returned by entry.find(class_='headline') because it could not find an element in entry matching the search criteria. In other words, there is, for that entry element, no child element with the class headline.
There are two such <li> elements, one with the id nativedvriver3 and the other with nativedvriver9, and you'd get that error for both. You need to first check if there is a matching element:
for entry in data:
headline = entry.find(class_='headline')
if headline is not None:
title = headline.get_text()
You'd have a much easier time if you used a CSS selector:
headlines = soup.select('#pipeline li .headline')
for headline in headlines:
headline_text = headline.get_text(strip=True)
print(headline_text)
This produces:
>>> headlines = soup.select('#pipeline li .headline')
>>> for headline in headlines:
... headline_text = headline.get_text(strip=True)
... print(headline_text)
...
Hedge funds fight back against tech in the war for talent
Goldman Sachs sees more price pain ahead for bitcoin
Dish Network shares rise 15% after subscriber losses are less than expected
Bitcoin whale makes ‘enormous’ losing bet, so now other traders have to foot the bill
The 'Netflix of fitness' looks to become a publicly traded stock as soon as next year
Amazon slammed for ‘insult’ tax bill in the UK despite record profits
Nasdaq could plunge 15 percent or more as ‘rolling bear market’ grips stocks: Morgan Stanley
Take-Two shares surge 9% after gamemaker beats expectations due to 'Grand Theft Auto Online'
UK bank RBS announces first dividend in 10 years
Michael Cohen reportedly secured a $10 million deal with Trump donor to advance a nuclear project
After-hours buzz: GPRO, AIG & more
Bitcoin is still too 'unstable' to become mainstream money, UBS says
Apple just hit a trillion but its stock performance has been dwarfed by the other tech giants
The first company to ever reach $1 trillion in market value was in China and got crushed
Apple at a trillion-dollar valuation isn’t crazy like the dot-com bubble
After Apple rallies to $1 trillion, even the uber bullish crowd on Wall Street believes it may need to cool off

Relationship extraction between person and city/state

I'm trying to take a sentence and extract the relationship between Person(PER) and Place(GPE).
Sentence: "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
For the final person, she has both a city and a state that could get extracted as her place. So far, I've tried using nltk to do this, but have only been able to extract her city and not her state.
What I've tried:
import re
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.sem.relextract import extract_rels, rtuple
sentence = "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
chunked = ne_chunk(pos_tag(word_tokenize(sentence)))
ISFROM = re.compile(r'.*\bfrom\b.*')
rels = extract_rels('PER', 'GPE', chunked, corpus = 'ace', pattern = ISFROM)
for rel in rels:
print(rtuple(rel))
My output is:
[PER: 'John/NNP'] 'is/VBZ from/IN' [GPE: 'Ohio/NNP']
[PER: 'Michael/NNP'] 'is/VBZ from/IN' [GPE: 'Florida/NNP']
[PER: 'Rebecca/NNP'] 'is/VBZ from/IN' [GPE: 'Nashville/NNP']
The problem is Rebecca. How can I extract that both Nashville and Tennesee are part of her location? Or even just Tennessee alone?
It seems to me that you have to first extract intra-location relationship (Nashville in Tennessee). Then ensure that you transitively assign all locations to Rebecca (if Rebecca is in Nashville and Nashville is in Tennessee then Rebecca is in Nashville and Rebecca is in Tennessee).
That would be one more relationship type and some logic for the above inference (things get complicated pretty quickly but it is hard to avoid it).

Resources