How do you use the Unified Verb Index in Python? - nlp

I know that nltk contains the VerbNet corpus, however, the Unified Verb Index combines information from it and 3 other useful sources. Is there any way to use this corpus in Python?

Through NLTK you can certainly access FrameNet, VerbNet and PropBank. I haven't done any work with the OntoNotes Sense Groupings.
Look at the below for an idea of how to get information out of these three resources. Each of them returns a list so you can grab list elements individually and examine them in however much detail you need.
from nltk.corpus import verbnet as vn
from nltk.corpus import framenet as fn
from nltk.corpus import propbank as pb
input = 'take'
vn_results = vn.classids(lemma=input)
if not vn_results:
print input + ' not in verbnet.'
else:
print 'verbnet:'
print vn_results
fn_results = fn.frames_by_lemma(input)
if not fn_results:
print input + ' not in framenet.'
else:
print 'framenet:'
print fn_results
pb_results = []
try:
pb_results = pb.rolesets(input)
except ValueError:
print input + ' not in propbank.'
if pb_results:
print 'propbank:'
print pb_results

Related

soup: extract all paragraphs with a specific class excluding those that are in tables

I have a messy old MCQ word document that I converted to HTML to extract the MCQ in a beautiful manner to make it useful & Easy to create a Microsoft forms.
The question sets that I want to extract MCQ from could be obtained here.
Now what I want is to convert this file to look something like so (here)
I wrote the following code to extract the paragraphs I need, but it is also extracting the paragraphs from the tables which is not useful to create list for question and list for potential solutions to each question. My code is as follow for now:
from bs4 import BeautifulSoup
import os
from nltk.tokenize import RegexpTokenizer
# Read .docx file in the CWD
file=[x for x in os.listdir() if '.htm' in x][0]
# Create a soup to parse information
soup = BeautifulSoup(open(file), "html.parser")
# Find all paragraph elements that contains required information
results = soup.find_all("p", class_="MsoNormal")
# Check number of words
tokenizer = RegexpTokenizer(r'\w+')
# Extract questions
Extract_questions=[x.text for x in results if len(tokenizer.tokenize(x.text))>1]
May you please help me to create the required docx file that I want? I really do not know where to start.
This is by no means complete code but you it can give you a start:
import pandas as pd
from itertools import groupby
from bs4 import BeautifulSoup
from textwrap import wrap
with open("page.html", "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser")
results = soup.select("body > div > .MsoNormal, body > div > .MsoNormalTable")
groups = [group := []]
for r in results:
if r.text.startswith("Question "):
groups.append(group := [r])
else:
group.append(r)
for g in groups:
for p in g:
if p["class"] == ["MsoNormalTable"]:
df = pd.read_html(str(p))[0].fillna("")
print()
print(df.to_csv(index=False, header=None, sep="\t"))
else:
t = p.get_text(strip=True).replace("\n", " ").strip()
if (
t
and "Question " not in t
and "L1EC" not in t
and "Lesson " not in t
):
print("\n".join(wrap(t, 70)))
print("-" * 80)
Prints:
--------------------------------------------------------------------------------
The price of ABC Financial News is increased from $2.00 to $2.50; this
leads to an increase in the sales of a competing financial
magazine, XYZ Finance, which now sells 120,000 copies a week, up from
100,000 copies a week. The cross-price elasticity of demand is closest
to:
0.8
1.22
1.25
--------------------------------------------------------------------------------
The following table lists the market shares of three major firms in an
industry. The industry's three-firm Herfindahl-Hirschman Index
is closest to:
Firms Market Share
X 20%
Y 30%
Z 10%
0.14
0.33
0.6
--------------------------------------------------------------------------------
Over a period of 1 year, a country’s real GDP increases from $168
billion to $179 billion, and the GDP deflator increases from 115 to
122.
The increase in the country’s nominal GDP over the year is closest to:
6.55%
13.03%
4.34%
--------------------------------------------------------------------------------
Consider the following statements:
Statement 1: A government is said to have a trade deficit if its
expenditure exceeds net taxes.
Statement 2: An economy must finance a trade deficit by borrowing from
the rest of the world.
Which of the following is most likely?
Only Statement 1 is incorrect.
Only Statement 2 is incorrect.
Both statements are correct.
--------------------------------------------------------------------------------
I want to thank #Andrej kesely, he really helped me go through this.
I was able to finally do it as so:
from bs4 import BeautifulSoup
import os
# from nltk.tokenize import RegexpTokenizer
from textwrap import wrap
import pandas as pd
from itertools import groupby
# Creating docx
from docx import Document
from docx.shared import Inches
# Read .docx file in the CWD
file=[x for x in os.listdir() if '.htm' in x][1]
# Create a soup to parse information
soup = BeautifulSoup(open(file), "html.parser")
# Find all paragraph elements that contains required information
results = soup.find_all("p", class_="MsoNormal")
results = soup.select("body > div > .MsoNormal, body > div > .MsoNormalTable")
groups = [group := []]
for r in results:
if r.text.startswith("Question "):
groups.append(group := [r])
else:
group.append(r)
# Extract questions and solution of each question
Questions=[Question:=[]]
Solutions=[Solution:=[]]
not_welcomed_phrases=["Question ","L1EC","Lesson ","L1R","L100"]
for g in groups:
# Get each question
q=[]
# length of the tables
Numberoftables=len([p1 for p1 in g if "<table" in str(p1)])
i_table=0
for p in g:
# Ensure that you are not parsing MCQ
if p["class"] != ["MsoNormalTable"]:
t=p.text.replace("\n", " ")
if ( t and not any(word in t for word in not_welcomed_phrases)):
q.append(t)
else:
# Check if you have two tables
if Numberoftables==1:
# t1=p.text
t1=[x.text for x in p.select("td > p") if len(x.text)>1]
# print(t1)
if t1:
Solutions.append(t1)
else:
if i_table==0:
# Get tables
Tables=[p1 for p1 in g if "<table" in str(p1)]
# Extract the first table into the question
Table1=Tables[0]
df = pd.read_html(str(Table1))[0].fillna("")
q.append(df.to_csv(index=False, header=None, sep="\t"))
i_table=1
else:
# Extract the second table into the question
Table1=Tables[1]
t1=[x.text for x in Table1.select("td > p") if len(x.text)>1]
if t1:
Solutions.append(t1)
Questions.append(Question := ["\n".join(q)])
# Print them into a new word document
document = Document()
for i in range(2,len(Questions)):
document.add_paragraph(Questions[i], style='List Number')
document.add_paragraph('\t a. '+Solutions[i-1][0])
document.add_paragraph('\t b. '+Solutions[i-1][1])
document.add_paragraph('\t c. '+Solutions[i-1][2])
document.save('NewStyleMCQ.docx')
Now one can simply do the trick of using MS form to convert this file into a form for students to use.

How to calculate TF-IDF values of noun documents excluding spaCy stop words?

I have a data frame, df with text, cleaned_text, and nouns as column names. text and cleaned_text contains string document, nouns is a list of nouns extracted from cleaned_text column. df.shape = (1927, 3).
I am trying to calculate TF-IDF values for all documents within df only for nouns, excluding spaCy stopwords.
What I have tried?
import spacy
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')
# subclass to modify stop word lists recommended from spaCy version 3.0 onwards
excluded_stop_words = {'down'}
included_stop_words = {'dear', 'regards'}
class CustomEnglishDefaults(English.Defaults):
stop_words = English.Defaults.stop_words.copy()
stop_words -= excluded_stop_words
stop_words |= included_stop_words
class CustomEnglish(English):
Defaults = CustomEnglishDefaults
# function to extract nouns from cleaned_text column, excluding spaCy stowords.
nlp = CustomEnglish()
def nouns(text):
doc = nlp(text)
return [t for t in doc if t.pos_ in ['NOUN'] and not t.is_stop and not t.is_punct]
# calculate TF-IDF values for nouns, excluding spaCy stopwords.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = df.cleaned_text
tfidf = TfidfVectorizer(stop_words=CustomEnglish)
X = tfidf.fit_transform(documents)
What I am expecting?
I am expecting to have an output as a list of tuples ranked in descending order;
nouns = [('noun_1', tf-idf_1), ('noun_2', tf-idf_2), ...]. All nouns in nouns should match those of df.nouns (this is to check whether I am on the right way).
What is my issue?
I got confused about how to apply TfidfVectorizer such that to calculate only TF-IDF values for Nouns extracted from cleaned_text. I am also not sure whether SkLearn TfidfVectorizer can calculate TF-IDF as I am expecting.
Not sure if you're still looking for a solution. Here is an option that you might wanna go ahead with.
First of all, by default TF_IDF takes into account the entire set of words, not just nouns. Hence, you would need to implement a custom TF_IDF function to apply results only on nouns. Following is a good reference on how TF_IDF works internally: https://www.askpython.com/python/examples/tf-idf-model-from-scratch
Instead of running the tf_idf function(as applied in the above url) for all words of a sentence/document, you can just run it on the list of nouns you've extracted,i.e., just change the code from:
def tf_idf(sentence):
tf_idf_vec = np.zeros((len(word_set),))
for word in sentence:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
tf_idf_vec[index_dict[word]] = value
return tf_idf_vec
to:
def tf_idf(sentence, nouns):
values = []
for word in nouns:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
values.append(value)
return tf_idf_vec, values
You now have a "values" list corresponding to the list of "nouns" for each sentence. Hope this makes sense.

How to count specific terms in tokenized sentences wthin a pandas df

I'm new to Python and nltk, so I would really appreciate your input on the following problem.
Goal:
I want to search and count the occurrence of specific terminology in tokenized sentences which are stored in a pandas DataFrame. The terms I'm searching for are stored in a list of strings. The output should be saved in a new column.
Since the words I'm searching for are grammatically inflected (e.g. cats instead of cat) I need a solution which not only displays exact matches. I guess stemming the data and searching for specific stems would be a proper approach but let's assume this is not an option here, as we would still have semantic overlaps.
What I tried so far:
In order to further handle the data I preprocessed the data while following these steps:
Put everything in lower case
Remove punctuation
Tokenization
Remove stop words
I tried searching for single terms with str.count('cat') but this doesn't do the trick and the data is marked as missing with NaN. Additionally, I don't know how to iterate over the search word list in an efficient way while using pandas.
My code so far:
import numpy as np
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Function to remove punctuation
def remove_punctuation(text):
return re.sub(r'[^\w\s]','',text)
# Target data where strings should be searched and counted
data = {'txt_body': ['Ab likes dogs.', 'Bc likes cats.',
'De likes cats and dogs.', 'Fg likes cats, dogs and cows.',
'Hi has two grey cats, a brown cat and two dogs.']}
df = pd.DataFrame(data=data)
# Search words stored in a list of strings
search_words = ['dog', 'cat', 'cow']
# Store stopwords from nltk.corpus
stop_words = set(stopwords.words('english'))
# Data preprocessing
df['txt_body'] = df['txt_body'].apply(lambda x: x.lower())
df['txt_body'] = df['txt_body'].apply(remove_punctuation)
df['txt_body'] = df['txt_body'].fillna("").map(word_tokenize)
df['txt_body'] = df['txt_body'].apply(lambda x: [word for word in x if word not in stop_words])
# Here is the problem space
df['search_count'] = df['txt_body'].str.count('cat')
print(df.head())
Expected output:
txt_body search_count
0 [ab, likes, dogs] 1
1 [bc, likes, cats] 1
2 [de, likes, cats, dogs] 2
3 [fg, likes, cats, dogs, cows] 3
4 [hi, two, grey, cats, brown, cat, two, dogs] 3
A very simple solution would be this:
def count_occurence(l, s):
counter = 0
for item in l:
if s in item:
counter += 1
return counter
df['search_count'] = df.apply(lambda row: count_occurence(row.txt_body, 'cat'),1)
You could then further decide how to define the count_occurence function. And, to search for the whole search_words, something like this will do the job, although it is probably not the most efficient:
def count_search_words(l, search_words):
counter = 0
for s in search_words:
counter += count_occurence(l, s)
return counter
df['search_count'] = df.apply(lambda row: count_search_words(row.txt_body, search_words),1)

Get all possibles pos tags from a single word

I'm currently trying to get all possible pos tags of a single word using Python.
From traditional pos taggers you get back only one tag, if you enter the single word.
Is there a way to get all possiblities?
Is it possible to search in a corpora(e.g. brown) for a specific word and not just for a category?
Kind regards & thanks for help
You can get the pos_tag() using this approach - specifically for brown,
import nltk
from nltk.corpus import brown
from collections import Counter, defaultdict
# x is a dict which will have the word as key and pos tags as values
x = defaultdict(list)
# looping for first 100 words and its pos tags
for word, pos in brown.tagged_words()[1:100]:
if pos not in x[word]: # to append one tag only once
x[word].append(pos) # adding key-value to x
# to print the pos tags for the word 'further'
print(x['further'])
#['RBR']

How to get synsets using sentiwordnet and calculate their sentiment score

import nltk
from nltk.corpus import sentiwordnet as swn,SentiSynset
swn.senti_synsets('slow')
for this code in python 3.4.3 i am getting output as:
<filter object at 0x0806DE70>
But it should be like:
[SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), \
SentiSynset('slow.v.03'), SentiSynset('slow.a.01'),SentiSynset('slow.a.02'), \
SentiSynset('slow.a.04'), SentiSynset('slowly.r.01'),SentiSynset('behind.r.03')]
I am really sorry if my question is vague or silly but i am new to python and nltk and not getting this one.And how can i get the sentiment scores of these synsets using sentiwordnet.
You are using python3 . In python3 filter function returns a filter object instead of list.
senti_synsets method is defined in nltk like this.
def senti_synsets(self, string, pos=None):
from nltk.corpus import wordnet as wn
sentis = []
synset_list = wn.synsets(string, pos)
for synset in synset_list:
sentis.append(self.senti_synset(synset.name()))
sentis = filter(lambda x : x, sentis)
return sentis
and since you are using python3, senti_synsets method returns a python filter object.
You can convert that filter object into list.
synsets=list(swn.senti_synsets('slow'))
synsets
output
[SentiSynset('decelerate.v.01'),
SentiSynset('slow.v.02'),
SentiSynset('slow.v.03'),
SentiSynset('slow.a.01'),
SentiSynset('slow.a.02'),
SentiSynset('dense.s.04'),
SentiSynset('slow.a.04'),
SentiSynset('boring.s.01'),
SentiSynset('dull.s.08'),
SentiSynset('slowly.r.01'),
SentiSynset('behind.r.03')]
from nltk.corpus import sentiwordnet as swn
good = swn.senti_synsets('good', 'n')
posscore=0
negscore=0
for synst in good:
posscore=posscore+synst.pos_score()
negscore=negscore+synst.neg_score()
print(posscore)
print(negscore)
better to get an average.

Resources