how can I get no more than three n-grams - python-3.x

I am finding the n-grams using the following function.
from nltk.util import ngrams
booksAfterRemovingStopWords = ['Zombies and Calculus by Colin Adams', 'Zone to Win: Organizing to Compete in an Age of Disruption', 'Zig Zag: The Surprising Path to Greater Creativity']
booksWithNGrams = list()
for line_no, line in enumerate(booksAfterRemovingStopWords):
tokens = line.split(" ")
output = list(ngrams(tokens, 3))
temp = list()
for x in output: # Adding n-grams
temp.append(' '.join(x))
booksWithNGrams.append(temp)
print(booksWithNGrams)
The output looks like:
[['Zombies and Calculus', 'and Calculus by', 'Calculus by Colin', 'by Colin Adams'], ['Zone to Win:', 'to Win: Organizing', 'Win: Organizing to', 'Organizing to Compete', 'to Compete in', 'Compete in an', 'in an Age', 'an Age of', 'Age of Disruption'], ['Zig Zag: The', 'Zag: The Surprising', 'The Surprising Path', 'Surprising Path to', 'Path to Greater', 'to Greater Creativity']]
However, I want no more that three n-grams. I mean I want the output to be like:
[['Zombies and Calculus', 'and Calculus by', 'Calculus by Colin'], ['Zone to Win:', 'to Win: Organizing', 'Win: Organizing to'], ['Zig Zag: The', 'Zag: The Surprising', 'The Surprising Path']]
How can I achieve that?

This is how you'd do it:
Logic:
Just count till three in the loop and break on i>2 (counts i=0,1,2 and breaks).
booksAfterRemovingStopWords = ['Zombies and Calculus by Colin Adams', 'Zone to Win: Organizing to Compete in an Age of Disruption', 'Zig Zag: The Surprising Path to Greater Creativity']
booksWithNGrams = list()
for line_no, line in enumerate(booksAfterRemovingStopWords):
tokens = line.split(" ")
output = list(ngrams(tokens, 3))
temp = list()
for i,x in enumerate(output):# Adding n-grams
if i>2:
break
temp.append(' '.join(x))
booksWithNGrams.append(temp)

Related

How does one extract the verb phrase in Spacy?

For example:
Ultimate Swirly Ice Cream Scoopers are usually overrated when one considers all of the scoopers one could buy.
Here I'd like to pluck:
Subject: "Ultimate Swirly Ice Cream Scoopers"
Adverbial Clause: "When one considers all of the scoopers one could buy"
Verb Phrase: "are usually overrated"
I have the following functions for subject, object, and adverbial clause:
def get_subj(decomp):
for token in decomp:
if ("subj" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
def get_obj(decomp):
for token in decomp:
if ("dobj" in token.dep_ or "pobr" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
def get_advcl(decomp):
for token in decomp:
# print(f"pos: {token.pos_}; lemma: {token.lemma_}; dep: {token.dep_}")
if ("advcl" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
phrase = "Ultimate Swirly Ice Cream Scoopers are usually overrated when one considers all of the scoopers one could buy."
nlp = spacy.load("en_core_web_sm")
decomp = nlp(phrase)
subj = get_subj(decomp)
obj = get_obj(decomp)
advcl = get_advcl(decomp)
print("subj: ", subj)
print("obj: ", obj)
print("advcl: ", advcl)
Output:
subj: Ultimate Swirly Ice Cream Scoopers
obj: all of the scoopers
advcl: when one considers all of the scoopers one could buy
However, the actual depenency type .dep_ for the final word of the VP, "are usually overrated", is "ROOT".
So, the subtree technique fails, as the subtree of ROOT returns the entire sentence.
You are wanting to construct something more like a “verb group” where you keep with the root verb only certain close dependents like aux, cop, and advmod but not ones like nsubj, obj, or advcl.

list index out of range but it seems impossible since it's only after 3 questions

kanji = ['上','下','大','工','八','入','山','口','九','一','人','力','川','七','十','三','二','女',]
reading = ['じょう','か','たい','こう','はち','にゅう','さん','こう','く','いち','にん','りょく','かわ','しち','じゅう','さん','に','じょ']
definition = ['above','below','big','construction','eight','enter','mountain','mouth','nine','one','person','power','river','seven','ten','three','two','woman']
score = number_of_questions = kanji_item = 0
def question_format(prompt_type,lang,solution_selection):
global reading,definition,score,num_of_questions,kanji_item
question_prompt = 'What is the '+str(prompt_type)+' for "'+str(kanji[kanji_item])+'"? (Keyboard:'+str(lang)+')\n'
solution_selection = [reading,definition]
usr = input(question_prompt)
if usr in solution_selection[kanji_item] and kanji[kanji_item]:
score += 1
num_of_questions += 1
else:
pass
kanji_item += 1
while number_of_questions != 18:
question_format('READING','Japanese',[0])
print('You got ',score,'/',number_of_questions)
while number_of_questions != 36:
question_format('DEFINITION','English',[1])
print('You got ',score,'/',number_of_questions)
I can't get past 大. but I can't see where it's messing up. I've tried to change pretty much everything. "kanji_item" is supposed to give a common index number so that the answers can match up. It gets through the first two problems with no hassle, but for some reason refuses to accept my third problem.
Problems:
- wrong name using number_of_questions vs. num_of_questions
- wrong way to check truthyness if usr in solution_selection[kanji_item] and kanji[kanji_item]: - the last part is always True as it is a non empty string
- lots of globals wich is not considered very good style
It would be easier to zip your three list together so you get tuples of (kanji, reading, description) and feed 2 of those into your function depending on what you want to test. You do this 2 times, once for reading, once for description.
You can even randomize your list of tuples to get different "orders" in which questions are asked:
kanji = ['上', '下', '大', '工', '八', '入', '山', '口', '九', '一' , '人',
'力', '川', '七', '十', '三', '二', '女',]
reading = ['じょう', 'か', 'たい', 'こう', 'はち', 'にゅう', 'さん', 'こう', 'く',
'いち', 'にん', 'りょく', 'かわ', 'しち', 'じゅう', 'さん', 'に', 'じょ']
definition = ['above', 'below', 'big', 'construction', 'eight', 'enter', 'mountain',
'mouth', 'nine', 'one', 'person', 'power', 'river', 'seven', 'ten', 'three',
'two', 'woman']
import random
data = list(zip(kanji, reading, definition))
random.shuffle(data)
def question_format(prompt_type, lang, kanji, solution):
"""Creates a question about *kanji* - the correct answer is *solution*
Returns 1 if correct else 0."""
question_prompt = f'What is the {prompt_type} for {kanji}? (Keyboard: {lang})'
usr = input(question_prompt)
if usr == solution:
return 1
else:
return 0
questions_asked = 0
correct = 0
for (kanji, reading, _) in data:
correct += question_format('READING','Japanese', kanji, reading)
questions_asked += 1
print('You got ',correct,'/',questions_asked)
for (kanji, _, definition) in data:
correct += question_format('DEFINITION','ENGLISH', kanji, definition)
questions_asked += 1
print('You got ',correct,'/',questions_asked)
After zipping our list and shuffling them data looks like
[('山', 'さん', 'mountain'), ('女', 'じょ', 'woman'), ('力', 'りょく', 'power'),
('上', 'じょう', 'above'), ('九', 'く', 'nine'), ('川', 'かわ', 'river'),
('入', 'にゅう', 'enter'), ('三', 'さん', 'three'), ('口', 'こう', 'mouth'),
('二', 'に', 'two'), ('人', 'にん', 'person'), ('七', 'しち', 'seven'),
('一', 'いち', 'one'), ('工', 'こう', 'construction'), ('下', 'か', 'below'),
('八', 'はち', 'eight'), ('十', 'じゅう', 'ten'), ('大', 'たい', 'big')]

python3 RNG variables

been coding for 2 days and ran into a snag.
I wanted to make a fun little practice thingy to help some people practice their english.
Basically they type in the words they wanna practice and then the randomizer throws out sentences for them to read.
Issue is I want to get the grammer right, I have done two different "Pronoun" catagories and 2 different "verb" catagories.
How do I link them together but still retain the random element of never knowing what combo you would get but making it still follow the grammer rules.
Code is below, any help would be awesome =D
# coding: utf-8
# In[73]:
noun1 = ("apple", "peach", "plum", "cat", "dog", "mouse")
verb1 = ("drink", "eat", "swim", "kill", "kick", "hit", "die")
# verb + S
verbS = ("drinks", "eats", "swims", "kills", "kicks", "hits", "dies")
adj1 = ("blue", "black", "red","big", "small", "tall", "short")
direction1 = ("up", "in", "out", "behind", "infront of", "over")
#pronouns Capital
Pronoun1 = ("I", "You", "They", "We")
PronounS = ("He", "She","Tt")
#pronouns non Capital
pronoun1 = ("I", "you", "they", "we")
pronounS = ("he", "she","it")
# In[74]:
import random
# In[75]:
def sentence1():
print(random.choice(Pronoun1),end=" ")
print(random.choice(verb1),end=" ")
print("the",end=" ")
print(random.choice(adj1),end=" ")
print(random.choice(noun1),end=" ")
return"."
# In[82]:
print(sentence1())

Using specific elements from a list in different loops for a multiple choice test python 3.x

Basically i'm trying to create a multiple choice test that uses information stored inside of lists to change the questions/ answers by location.
so far I have this
import random
DATASETS = [["You first enter the car", "You start the car","You reverse","You turn",
"Coming to a yellow light","You get cut off","You run over a person","You have to stop short",
"in a high speed chase","in a stolen car","A light is broken","The car next to you breaks down",
"You get a text message","You get a call","Your out of gas","Late for work","Driving angry",
"Someone flips you the bird","Your speedometer stops working","Drinking"],
["Put on seat belt","Check your mirrors","Look over your shoulder","Use your turn signal",
"Slow to a safe stop","Relax and dont get upset","Call 911", "Thank your brakes for working",
"Pull over and give up","Ask to get out","Get it fixed","Offer help","Ignore it","Ignore it",
"Get gas... duh","Drive the speed limit","Don't do it","Smile and wave","Get it fixed","Don't do it"],
[''] * 20,
['B','D','A','A','C','A','B','A','C','D','B','C','D','A','D','C','C','B','D','A'],
[''] * 20]
def main():
questions(0)
answers(1)
def questions(pos):
for words in range(len(DATASETS[0])):
DATASETS[2][words] = input("\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] +
'\nA.)'+random.choice(DATASETS[1]) + '\nB.)%s' %DATASETS[1][words] + '\nC.)'
+random.choice(DATASETS[1]) + '\nD.)'+random.choice(DATASETS[1])+
"\nChoose your answer carefully: ")
def answers(pos):
for words in range(len(DATASETS[0])):
DATASETS[4] = list(x is y for x, y in zip(DATASETS[2], DATASETS[3]))
print(DATASETS)
I apologize if the code is crude to some... i'm in my first year of classes and this is my first bout of programming.
list 3 is my key for the right answer's, I want my code in questions() to change the position of the correct answer so that it correlates to the key provided....
I've tried for loops, if statements and while loops but just cant get it to do what I envision. Any help is greatly appreciated
tmp = "\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] + '\nA.)'
if DATASETS[3][words] == 'A': #if the answer key is A
tmp = tmp + DATASETS[1][words] #append the first choice as correct choice
else:
tmp = tmp + random.choice(DATASETS[1]) #if not, randomise the choice
Do similar if-else for 'B', 'C', and 'D'
Once your question is formulated, then you can use it:
DATASETS[2][words] = input(tmp)
This is a bit long but I am not sure if any shorter way exists.

python substring (slicing) compare not always working

`list1 = ["Arizona","Atlanta","Baltimore","Buffalo","Carolina","Chicago",
"Cincinnati","Cleveland","Dallas","Denver","Detroit","Green Bay","Houston",
"Indianapolis","Jacksonville","Kansas City","L.A. Chargers","L.A. Rams",
"Miami","Minnesota","New England","New Orleans","NY Giants","NY Jets",
"Oakland","Philadelphia","Pittsburgh","San Francisco","Seattle",
"Tampa Bay","Tennessee","Washington"]
a = "New Orleans at Oakland"
k = a.find("at")
print (k)
for n in range(0,31):
# b = list1[n]
# print(b[0:k-1]+" "+a[0:k-1])
idx = a.find(list1[n], 0, k-1)
if idx > 0:
print(n)
break
print ("awa team at index" + str(n+1))
for n in range(0,31):
idx = a.find(list1[n], k+2, len(a))
if idx > 0:
print(n)
break
print ("hom team at index" + str(n+1))`
I just started python 2 days ago and I cannot get this to work completely. The program finds the team in the second for loop correctly, but doesn't find the team in the first for loop. I put in the statements that are commented out to see if the strings were somehow truncated, but they are correct. Can anyone tell me what is wrong here?
There's no need to brute force the search. Python has methods that accomplish what you need.
list1 = ["Arizona", "Atlanta", "Baltimore", "Buffalo", "Carolina", "Chicago",
"Cincinnati", "Cleveland", "Dallas", "Denver", "Detroit", "Green Bay", "Houston",
"Indianapolis", "Jacksonville", "Kansas City", "L.A. Chargers", "L.A. Rams",
"Miami", "Minnesota", "New England", "New Orleans", "NY Giants", "NY Jets",
"Oakland", "Philadelphia", "Pittsburgh", "San Francisco", "Seattle",
"Tampa Bay", "Tennessee", "Washington"]
a = "New Orleans at Oakland"
# Create a list of the teams involved in the game
teams = a.split(" at ")
# Iterate through the teams involved in the game
for team in teams:
# The index() method returns the lowest index in list that obj appears
index = list1.index(team)
# If the the team was found then index is valid
if index:
print(index)
print(list1[index])
if you just want to have the index, you can use the .index() you do not have to "loop"
Example code:
list1 = ["Arizona","Atlanta","Baltimore","Buffalo","Carolina","Chicago",
"Cincinnati","Cleveland","Dallas","Denver","Detroit","Green Bay","Houston",
"Indianapolis","Jacksonville","Kansas City","L.A. Chargers","L.A. Rams",
"Miami","Minnesota","New England","New Orleans","NY Giants","NY Jets",
"Oakland","Philadelphia","Pittsburgh","San Francisco","Seattle",
"Tampa Bay","Tennessee","Washington"]
a = "New Orleans at Oakland"
a = a.split(' at ')
idx_home_team = list1.index(a[0])
idx_away_team = list1.index(a[1])
print(idx_home_team, idx_away_team)

Resources