Python3: writing article in own words - python-3.x

I am trying to extract the summary from news article. Here is what I have tried till now:
>>> from newspaper import Article
>>> url = 'http://abcnews.go.com/International/wireStory/north-korea-ready-deploy-mass-produce-missile-47552675'
>>> article = Article(url)
>>> article.download()
>>> article.parse()
>>> article.nlp()
>>> article.keywords
['ready', 'north', 'test', 'missiles', 'deploy', 'tested', 'korea', 'missile', 'launch', 'nuclear', 'capable', 'media', 'massproduce']
>>> article.summary
'North Korea says it\'s ready to deploy and start mass-producing a new medium-range missile capable of reaching Japan and major U.S. military bases there following a test launch it claims confirmed the missile\'s combat readiness and is an "answer" to U.S. President Donald Trump\'s policies.\nPyongyang\'s often-stated goal is to perfect a nuclear warhead that it can put on a missile capable of hitting Washington or other U.S. cities.\nAt the request of diplomats from the U.S., Japan and South Korea, a United Nations\' Security Council consultation on the missile test will take place Tuesday.\nNorth Korea a week earlier had successfully tested a new midrange missile — the Hwasong 12 — that it said could carry a heavy nuclear warhead.\nExperts said that rocket flew higher and for a longer time than any other missile previously tested by North Korea and represents another big advance toward a viable ICBM.'
I have seen that the summary generated in the above paragraph is taken exactly from the news article itself. Whereas I want to achieve human like summarization (In own words or spin content or anything, but should be relevant).
Kindly, advice me or suggest me the what I need to do so that my code works exactly what I want?

There is sumy which does offer several ways to summarize english texts. Most (if not all) of those algorithm will extract sentences from the input document. Based on those sentences you can postprocess them to split and/or merge sentences and use synonyms.
Outside that, this topic is still not much in the field of engineering but research. Try AI StackExchange.

Related

Beautiful Soup not finding element by ID

I am trying to scrape incoming headlines from forexfactory.com.
from bs4 import *
import requests
source = requests.get("https://www.forexfactory.com/news").text
soup = BeautifulSoup(source, 'lxml')
list = soup.find(id="ui-outer")
print(list)
This returns None when it should return the Div containing all the headlines. Any idea what would be wrong? I have tried searching by div, by Ul id, by li id and some other ways. It always returns None.
Thank You.
For dynamically created elements give Selenium a try
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
URL = 'https://www.forexfactory.com/news'
driver.get(URL)
Wait a few seconds until also dynamically content is loaded
driver.implicitly_wait(5) # wait for seconds
Get your element
uiOuter = driver.find_element_by_id('ui-outer')
Example for all links (story tile)
aHref = driver.find_elements_by_css_selector('div.flexposts__story-title a')
[x.text for x in aHref]
Output
['EU\'s Barnier says "fundamental divergences" persist in UK trade talks',
'With end of crisis programs, Fed faces tricky post-pandemic transition',
'Markets Look Past Near-Term Challenges',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'EURUSD consolidates around 1.1900; indicators are flat',
'Rush for emerging market company bonds as investors look beyond COVID-19',
'Europe’s Virus Lockdowns Push Economy Into Another Contraction',
'Interactive Brokers enhances Client Portal',
'BoE’s Haldane: Risk That Anxiety Leads To Gloom Loop',
'Sharpest fall in UK private sector output since May. Manufacturing growth offset by renewed...',
'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
'EU Flash PMI signals steep downturn in November amid COVID-19 lockdowns',
'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
'Sharp decline in French business activity amid fresh COVID-19 lockdown',
'Rishi Sunak says Spending Review will not spell austerity',
'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
'Japan’s Labor Thanksgiving Day goes viral',
'Ranking Asset Classes by Historical Returns (1985-2020)',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'EURUSD consolidates around 1.1900; indicators are flat',
'US Dollar stuck near support, NZ$ strikes two-year high',
'US Dollar stuck near support, NZ$ strikes two-year high',
'Georgia confirms results in latest setback for Trump bid to overturn Biden win',
'Canada to roll over terms of EU trade deal with UK',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
"COVID-19: 'No return to austerity', says chancellor as he hints at public sector pay freeze",
'EURUSD consolidates around 1.1900; indicators are flat',
'New Zealand Dollar May Rise as RBNZ Holds Fire on Negative Rates',
'Interactive Brokers enhances Client Portal']

How do semantic text comparison APIs work

I am currently doing a project where we are trying to gauge explanatory answers submitted by users against a correct answer. I have come across APIs like dandelion and paralleldots, both of which are capable of checking how close 2 texts are to each other semantically.
These APIs are giving me favorable responses for questions like:
What is the distinction between debtor and creditor?
Answer1: A debtor is a person or enterprise that owes money to another
party. A creditor is a person, bank, or other enterprise that has
lent money or extended credit to another party.
Answer2: A debtor has a debt or legal obligation to pay an amount to
another person or entity, from whom goods were purchased or services
were obtained. A creditor may be a bank, supplier
Dandelion gave me a score of 81% and paralleldots gave me 4.8/5 for the same answer. This is quite expected.
However, before I prepare a demo and plan to eventually use them in production, I am interested in understanding to some extent how these APIs are generating these scores.
Is it a tf-idf based vector product of the stemmed POSses??
PS: Not an expert in NLP
This question is very broad: semantic sentence similarity is an open issue in NLP and there are a variety of ways of performing this task, all of them being far from perfect at the current stage. As an example, just consider that:
Trump is the president of the United States
and
Trump has never been the president of the United States
have a semantic similarity of 5 according to paralleldots. Now, according to your definition of similarity this may be ok or not, but the point is that according to what you have to do with this similarity it may not be fully suitable if you have specific requirements.
Anyway, as for the implementation, there's no single "standard" way of performing this and there's a pletora of features that can be used: tf-idf (or equivalent), syntactic structure of the sentence (i.e. constituency or dependency parse tree), mention of entities extracted from the text, etc... or, following the latest trends, a deep neural network which doesn't need any explicit feature.

Identifying the context of word in sentence

I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)

Determining customary distance unit from ISO 3166 country code

ISO 3166 defines country codes such as GB, US, FR or RU.
I would like a reasonably definitive association from these country codes to the customary unit of measure for distances between places in those countries.
Specifically on iOS and OS X, the country code can be retrieved from NSLocale:
[[NSLocale currentLocale] objectForKey: NSLocaleCountryCode];
NSLocale also provides a way to see if a country uses metric or non metric units:
const bool useMetric = [[[NSLocale currentLocale] objectForKey: NSLocaleUsesMetricSystem] boolValue];
However, this is not sufficient. For example, in Great Britain (GB) the metric system is widely used, but distances between places continue to be officially measured in miles rather than kilometres.
I also faced this problem :-)
Countries which uses Metric system but still use miles :--
1. GB is only exception which still uses miles instead of metric.
Note: Canada also stared using KMs for road transport. Although, Canada still follows miles for train and horse transport
Countries which do not uses Metric System
Liberia, Myanmar and United States of America.
Note: Myanmar (Formerly Burma) is planning to move to metric system. Currently, Myanmar uses its own system different from imperial and metric.
In my app, i check whether country uses imperial or metric.
if (metric) then assign kms for all countries except britan
if (imperial) then assign miles for all countries except Burma
if burma then assign burma unit
if britan then assign miles
A chart showing countries using miles per hour for road speeds is available. It cites Wikipedia's articles on miles per hour as its source, which has the following to say:
These include roads in the United Kingdom,[1] the United States,[2] and UK and US territories; American Samoa,[3] the Bahamas,[4] Belize,[5] British Virgin Islands,[6] the Cayman Islands,[7] Dominica,[8] the Falkland Islands,[9] Grenada,[10] Guam,[11] Burma,[12] The N. Mariana Islands,[13] Samoa,[14] St. Lucia,[15] St. Vincent & The Grenadines,[16] St. Helena,[17] St. Kitts & Nevis,[18] Turks & Caicos Islands,[19] the U.S. Virgin Islands,[20][21] Antigua & Barbuda (although km are used for distance),[22] and Puerto Rico (same as former).[22]
I don't see a way to download this as data keyed from ISO3166 country code, but it's not a huge task to compile one.
I'll leave this answer unaccepted in case a better suggestion is available.
Officially, road distances in the UK are in kilometres, but road signs are in miles. Confusing? Yes! When a road engineer get aplan of a road, everythign is in kiolometres, government statistics are in kilometres, but road signs and car odometers are in miles. See https://en.wikipedia.org/wiki/Driver_location_sign for more info.

How do you extract various meanings of a certain word

Given "violence" as input would it be possible to come up with how violence construed by a person (e.g. physical violence, a book, an album, a musical group ..) as mentioned below in Ref #1.
Assuming if the user meant an Album, what would be the best way to look for violence as an album from a set of tweets.
Is there a way to infer this via any of the NLP API(s) say OpenNLP.
Ref #1
violence/N1 - intentional harmful physical action.
violence/N2 - the property of being wild or turbulent.
Violence/N6 - a book from Neil L. Whitehead; nonfiction
Violence/N7 - an album by The Last Resort
Violence/N8 - Violence is the third album by the Washington-based Alternative metal music group Nothingface.
Violence/N9 - a musical group which produced the albums Eternal Nightmare and Nothing to Gain
Violence/N10 - a song by Aesthetic Perfection, Angel Witch, Arsenic, Beth Torbert, Brigada Flores Magon, etc on the albums A Natural Disaster, Adult Themes for Voice, I Bificus, Retribution, S.D.E., etc
Violence/N11 - an album by Bombardier, Dark Quarterer and Invisible Limits
Violence/N12 - a song by CharlElie Couture, EsprieM, Fraebbblarnir, Ian Hunter, Implant, etc on the albums All the Young Dudes, Broke, No Regrets, Power of Limits, Repercussions, etc
Violence/N18 - Violence: The Roleplaying Game of Egregious and Repulsive Bloodshed is a short, 32-page roleplaying game written by Greg Costikyan under the pseudonym "Designer X" and published by Hogshead Publishing as part of its New Style line of games.
Violence/N42 - Violence (1947) is an American drama film noir directed by Jack Bernhard.
Pure automatic inference is a little to hard in general for this problem.
Instead we might use :
Resources like WordNet, or a semantics dictionary.
For languages other than English you can look at eurowordnet (non free) dataset.
To get more meaning (i.e. for the album sense) we process some well managed resource like Wikipedia. Wikipedia as a lot of meta information that would be very useful for this kind of processing.
The reliability of the process is achieve just by combining the maximum number of data source and processing them correctly, with specialized programs.
As a last resort you may try hand processing/annotating. Long and costly, but useful in enterprise context where you need only a small part of a language.
No free lunch here.
If you're working on English NLP in python, then you can try the wordnet API as such:
from nltk.corpus import wordnet as wn
query = 'violence'
for ss in wn.synsets(query):
print query, str(ss.offset).zfill(8)+'-'+ss.pos, ss.definition
If you're working on other human languages, maybe you can take a look at the open wordnets available from http://casta-net.jp/~kuribayashi/multi/
NOTE: the reason for str(ss.offset).zfill(8)+'-'+ss.pos, it's because it is used as the unique id for each sense of a specific word. And this id is consistent across the open wordnets for every language. the first 8 digits gives the id and the character after the dash is the Part-of-Speech of the sense.
Check this out: Twitter Filtering Demo from Idilia. It does exactly what you want by first analyzing a piece of text to discover the meaning of its words and then filtering the texts that contain the sense that you are looking for. It's available as an API.
Disclaimer: I work for Idilia.
You can extract all contexts "violence" occurs in (context can be a whole document, or a window of say 50 words), then convert them into features (using say bag of words), then cluster these features. As clustering is unsupervised, you won't have names for the clusters, but you can label them with some typical context.
Then you need to see which cluster "violence" in the query belongs to. Either based on other words in the query which act as a context or by asking explicitly (Do you mean violence as in "...." or as in "....")
This will be incredibly difficult due to the fact that the proper noun uses of the word 'Violence' will be incredibly infrequent as a proportion of all words and their frequency distribution is likely highly skewed in some way. We run into these problems almost any time we want to do some form of Named Entity Disambiguation.
No tool I'm aware of will do this for you, so you will be building your own classifier. Using Wikipedia as a training resource as Mr K suggested is probably your best bet.

Resources