Find the maximum element in stream of integers - amazon

There is a king.One day he on his son's b'day he decides to find the most beautiful girl in the kingdom to marry to his son. For the same he calls all the girls in the kingdom. All the girls arrange themselves in a queue and on calling are presented in front of the king. A king can either keep the girl or send it away. Once a girl is sent , it can't be called again to be presented before the king. Build a strategy so that king selects the maximum possible beautiful girl.Not necessary the most beautiful but the maximum he can select.
The problem can be simply reduced to a simpler statement. Given a stream of integers coming how can u select the maximum element.At an instant you only have a single integer and no future information is available.

See: Secretary problem

I add to what Mitchnull said. You cannot solve this problem determinstically. Solutions to this problem are probalistic. To prove that, simply assume that the most beautiful can be the last.
All solutions I know for this problem depends on N, the number of girls, which is a non-practical case in many scenarios.
The problem is not easily solved during an interview, there are some nice mathmatical tricks, which are not easily found.

Related

Exclude text of more than one tag using Beautifulsoup

I want to scrape texts from the below website using beautifulsoap, but not all text. so I want to avoid a text contained in any of the following:
1- text contained in a link
2- text contained in or describe the images.
3- avoid the last sentences that contain words such as 'Disclosure'.
I tried the following but did not work properly, so any help would be really appreciated
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.cnbc.com/2020/07/16/perseverance-is-key-for-gen-z-to-succeedand-create-change-in-the-world.html')
soup = BeautifulSoup(r.text,'lxml')
txt = ''
for row in soup.find_all('div', {"class": "group"}):
if row.a:
continue
txt += ''.join(row.text)
print(txt)
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.cnbc.com/2020/07/16/perseverance-is-key-for-gen-z-to-succeedand-create-change-in-the-world.html')
soup = BeautifulSoup(r.text,'lxml')
divs = soup.find_all("div", class_="group")
data = []
for div in divs:
p = div.find_all("p")
for i in p:
if i.find_all("a"):
for j in i.find_all("a"): j.extract()
if i.find_all("img"):
for j in i.find_all("img"): j.extract()
if "disclosure" not in i.text.lower(): data.append(i.get_text(strip=True))
print("\n".join(data))
Output:
CNBC's "College Voices 2020" is a series written by CNBC summer interns from universities across the country about coming of age, launching new careers and job hunting during a global pandemic. They're finding their voices during a time of great social change and hope for a better future. What money issues are they facing? How are they navigating their student loans? How are they getting work experience, networking and applying for jobs when so many opportunities have been canceled or postponed? How important is diversity and a company's values to Gen Z job seekers?
In life, challenges arise, but they are meant to be conquered through perseverance, and never giving up. This is something I was taught at a young age. As a kid, I ran track and field, which presented obstacles — both literally and psychologically.
As much as I wanted to believe in never giving up, the notion of doing so always lingered in the back of my mind during harsh practices and races that did not go my way. I did not like losing (even today!), but I especially did not like knowing that I could put everything I had down on the line, and still come up short. It was through this that I realized hard work and dedication do not always bring wins, but they do bring a spirit of perseverance, and from this perseverance — hope.
I did not prepare for the world we are in today. No one did. Coronavirus has devastated the economy — the unemployment rate has skyrocketed and the job market is the worst since the Great Depression. Graduating, finding that first job and launching your adult life are difficult enough but add in all of this and it can be overwhelming. Yet, just like how I used to jump over hurdles as a runner back in high school, all of these issues are just obstacles that we need to jump over in order to press forward with our lives and our careers.
It all comes down to one thing: perseverance.
The coronavirus has changed all of our lives: The way we consume, go outside, and work have all changed because of this pandemic. We need perseverance more now than ever!
One obstacle in particular that I believe my generation will have to overcome is social justice. Through the death of George Floyd, protests have occurred all over the country. What many are hoping, including myself, is this wave of protests help bring not only light, but true change in terms of how African Americans are treated in the system and everywhere – including the workplace. This includes being equally paid, better represented in positions of power and more financially supported on the local level. As an African American, I know the continuous struggle of not only having to be your best in every place you step foot in, but knowing a mistake could be the spark that confirms a bias or stereotype that is the product of generations worth of racism and abuse. I hope that as bad as this coronavirus has been, and the protests that have come from generations worth of frustration and anger … that perseverance and goodness come out through it.
More From Invest in You:
Another hurdle I believe my generation has to overcome is student-loan debt. This has always been an issue but now, many classes are being pushed online, but the cost of these classes has not gone down. The price tag that comes along with a four-year+ degree adds up to decades worth of paying back debt. (Thefor student loans of between $20,000 and $40,000, according to the Department of Education.) And sometimes you don't even wind up working in the field you have your degree in! I know personally that my first priority in finding a job is to be able to pay off my loans as quickly as possible. The ever daunting thought of carrying thousands of dollars worth of debt does not rest easy in my mind. But I know that the education I received has given me the opportunity to garner wealth for generations to come. My hope is that I can persevere through this battlefield called life with my newfound skills and knowledge.
And, of course, for all of us in Generation Z, there is the challenge of finding our first job out of college. What would be a daunting task already, has been heightened by the coronavirus. For me personally, I find that my job search has turned from wanting to find a Fortune 500 company job, to wanting a more local business. My thought behind this is finding ways to help businesses that want bright and hungry young people who might otherwise not get that talent because of the enticing income that comes from larger corporations. I hope to put my skills to work to help a company move toward a brighter tomorrow!
in the U.S. right now, according to the Labor Department — and the number is bound to go higher if restrictions on businesses stay where they are or increase. The struggle for all of these people to find good paying jobs will be reminiscent of the Great Depression fight for jobs — long lines and countless job applications to simply find a job to help pay bills. There are already so many people out of work and an entire generation graduating college and trying to join the work force — it becomes simple math to understand that not every one of us will get a job. This makes me concerned and frustrated but this obstacle will not stop me. The perseverance I learned growing up will help me to not give up when faced with challenges, no matter how big they are. I owe it to my family and friends to strive for success, because it is only then that I can be a better individual.
I also want my generation to do better for the environment. Like most people, I take the danger of climate change and global warming very seriously. What we do today will impact the way we are all able to survive in the future. As a result, the way that we work, operate and consume must change as well. Like many Gen Z young adults, I want the impact I make in my everyday life to help maintain our Earth, especially for the future generations that come after us.
Hard work and dedication don't always bring wins, but they do bring a spirit of perseverance. Perseverance brings hope, and hope brings forth action. As an African American, I think the future looks bright, but only because of the groundwork that my generation is putting forth to not let underrepresented voices go unheard. I truly believe that as hard as this next decade will be from a financial and job-status perspective, there will be positive change.
I believe a lot of people in Gen Z and generations to come won't just go for jobs in typical fields because of the money, they will look for jobs that can make a true impact — both professionally and personally. Having representation in fields that are overly male or overly white do not help people. In fact, we as a society end up working backwards when that happens. Rather, we need to extend our arms out to bring everyone in and be honest about the history of our country -- and where we wish to go from here. These hurdles will be just like the hurdles I once used to jump over as a track athlete. And just like it took hard work, dedication and perseverance in order for me to do my best to get over those hurdles, we will all have to do the same to get over these hurdles.
It's time to tap into the perseverance within all of us to get through these challenging times and make the world a better place for generations to come.
You can search for all <p> tags directly under tags with class group and extract the text non-recursively:
import requests
from bs4 import BeautifulSoup
url = 'https://www.cnbc.com/2020/07/16/perseverance-is-key-for-gen-z-to-succeedand-create-change-in-the-world.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# uncomment if you want to print the first header:
#print(soup.select_one('.group em').text)
for p in soup.select('.group > p')[:-3]:
for t in p.find_all(text=True, recursive=False):
print(t.strip())
Prints:
In life, challenges arise, but they are meant to be conquered through perseverance, and never giving up. This is something I was taught at a young age. As a kid, I ran track and field, which presented obstacles — both literally and psychologically.
As much as I wanted to believe in never giving up, the notion of doing so always lingered in the back of my mind during harsh practices and races that did not go my way. I did not like losing (even today!), but I especially did not like knowing that I could put everything I had down on the line, and still come up short. It was through this that I realized hard work and dedication do not always bring wins, but they do bring a spirit of perseverance, and from this perseverance — hope.
... and so on.
An alternative approach could be to search for all the things you don't want and remove it using .extract() and then you can loop through the results. The benefit is that it may be easier to read and extend. Also you might want to add validation to assert certain phrases exist.
Here is an example:
from bs4 import BeautifulSoup #py -m pip install beautifulsoup4 --user
import requests
def provide_soup(url):
r = requests.get(url)
r.raise_for_status()
return BeautifulSoup(r.text,'lxml')
def remove_noise(soup):
noise_starting_phraze = ('CNBC\'s "College Voices 2020"', 'More From Invest in You:', 'SIGN UP:', 'CHECK OUT:', 'Disclosure:')
paragraph = soup.find_all('p')
for p in paragraph:
if p.text.strip().startswith(noise_starting_phraze):
p.extract()
def remove_key_point(soup):
key_point = soup.find('div', {"class": "RenderKeyPoints-wrapper"})
key_point.extract()
def provide_content_as_text(soup):
return ''.join([row.text for row in soup.find_all('div', {"class": "group"})])
soup = provide_soup('https://www.cnbc.com/2020/07/16/perseverance-is-key-for-gen-z-to-succeedand-create-change-in-the-world.html')
remove_key_point(soup)
remove_noise(soup)
results = provide_content_as_text(soup)
print(results)

How to compare different groups with different sample size?

I am plotting students' data from different schools to see the difference between male and female student numbers at some majors. I am using python, I already plot the data for some schools and as I expected male numbers are genuinely higher, then I realized that for each school I have a different number of total students. does my work make any sense when the sample size is different? if not may I have some suggestion to make some changes.
Now I'm realizing. Look: you have two classes where the first has 2 men, the second one - 20 men. And their marks. 2 men - both are 90/100. And 20 marks in the second one. Let it be a range from 40 to 80. Will it be correct if we say "Well, the first class made the test much better then the second"? Ofc, not.
To solve this problem just take a min(sizes of samples). If it looks too small, so throw away this programm, because you have not enough data to say something. And put a total size of sample via proxy legend or text, or add it in title. Anyway it will show you reliability of your results.
This question is not about programming, but rather about statistics, but I will try to answer.
Important question I didn't get there: What are you doing it for? If you ask question like "Hmm... Are there more men than women in the population(in this case, population = all persons in major programm)?". So each schools aren't important for you,and you can work with samples as you work with one (but don't forger gather them).
But you may ask question: "are there any difference between schools in samples?". In this case, gathering is not correct. For this purpose I highly recommend barh plot with stucked=True for each school. And for normalization just use percents. And difference between samples' size won't be problem.
PLS, If you ask question, put some code. 3 rows and one plot from a sample would be very helpful...

having some trouble writing and removing punctuation python

so i'm trying to clean up a file for my assignment by converting to lowercases and removing punctuations.
content:
I'm a fan of soft serve ice cream and Guptill's Coney Express has delicious ice cream with many flavors. I've tried Kurver Kreme in Colonie, Tastee Freeze in Delmar and Country Drive Inn in Clifton Park, but I think that this place has the best soft serve ice cream. The portions are generous and the taste is very rich. For example, the brownie sundae is decadently delicious but likely too much for one person. They also have cupcake sundaes which I am looking to try soon!
The nurses here were very attentive and wonderful. I was able to have the same surgical nurse that I had for another procedure a few years ago. The anesthesiologist listened to my concerns and acting skillfully and compassionately. I felt well taken care of there.
Pretty great! Okay, so this place is obviously not Vegan since they have a bunch of cheese and egg offerings, BUT I see that they do offer plenty of vegan alternatives.\n\nI was sort of skeptical being here because the prices were pretty hefty, I felt. I guess it looked like a fast-food/diner joint, but it charged a little more. \n\nAnyway, their homemade hot sauce is AMAZING. I got the eggs benedict for dinner and J got an omelet. Both were really good. I do love their homefries.. but the next time I come here, I want onion rings or fries. Those onion rings looked amazing.\n\nLastly, the food came relatively quickly.\n\nNot a fan of the service. They tried to seat us at this edge facing the stoves, without asking, so I asked for a booth. Then at the booth, the server didn't refill waters very well but didn't feel bad emphasizing over and over whether or not we wanted their $5-7 desserts. Honestly, a slice of pie for $6.50? Veggie Galaxy, you are t r i p p i n !\n\nBut great food! (especially breaky!)
The Tale of the 4-Starred, Up Close and Personal Bar.\n\nThe Back Abbey is a bar that is located in the charming West Village part of Claremont. The area resembles the old shopping strip-meets-new shopping strip of San Luis Obispo. What they did -- or so I've been told -- is that they've managed to preserve the old strip, leaving it looking the way it presumably did in the 70s. Rhino Records is located in the older part. The Back Abbey in the new part. Both distinct areas of their aesthetic and consumer pull. \n\nThe Back Abbey specializes in Belgian draft beer. They carry some German and English beer, as well.\n\nWow, the selection of beer. Let me tell you! A lof of them I have never heard of. I tasted the Augustijn (Belgian) and the Hofbrau Original (German). Both were good, but I prefer the Hofbrau.They come in different sized glasses, with the Hofbrau coming in a long glass. The Augustijn was $9, and the Hofbrau $6. The Hofbrau came in the larger glass, so the Augustijn must be considered some premium brew for it to be in a smaller glass and come knocking down doors at a mighty $9. \n\nI liked the overall aesthetics of the place. There is an up-close-and-personal feel to the bar. Along with the standard bar and barstools set-up, there's a long table that goes from one end of the bar to the other. People sit across from each other and, since the tables are not wide, you're not far at all from the person sitting in front of you. Like I wrote earlier, up close and personal is what this bar has going for it.\n\nThey have a food menu, as well. Above their draft selection of beer, they have a chalkboard placed high up with a list of what they serve. For those veggies among us, they even make a veggie burger.\n\nLastly, the patrons all seem to know each other. Maybe it's the small town mentality. But it's not exactly a backwards, hick town. It's very much modern in certain ways, and people just love to talk. Couple that with brew, and there's a lot of talk going on.\n\nI had a discussion with someone at the bar, who bartends at a bar not too far from The Back Abbey, and he told me that the Augustijn is the oldest beer recipe known to man. (Ah, ok. Old is vintage. Vintage is expensive. Thus, Augustijn is expensive. And two + two = four. I got this down, partna'.) Again, can we give it up for \
As a vegan, I try to support places that are not wholly vegetarian, but that made the effort to put a vegetarian section on their menu, to show them that it's worth keeping. Thai Singha House does indeed have a section of their menu that reads vegetarian, but, unfortunately, their definition of vegetarian is different from mine.\n\nAll of the vegetarian curries, I was fortunately informed upon asking, do indeed contain fish sauce, which is an ingredient in the curry itself, and cannot be omitted.\n\nMy server suggested the pad thai as a fish-free vegetarian alternative, which I ordered, holding the egg. When it arrived, there appeared to be tofu in there and no egg, which was a good sign. But after a few blind, trusting bites I quickly realized that there was also chicken or pork mixed in as well. My server graciously took it back and honored my request for a house salad, of which I wasn't charged.\n\nI don't blame Thai Singha House for not understanding the definition of vegetarian (sadly, I know a few \
On a recent visit to SLO, I got taken out to the Natural Cafe, which bears a distinct resemblance to some of the places I remember from growing up in Berkeley. Very natural woodsy, with just enough sprouts on the sandwich to add a little crunch. It's casual, with counter service only.\n\nMy turkey burger was not too dry, but the guacamole and special sauce helped it along in the moistness department. I didn't walk away from the meal feeling all gross and over full like you would with a normal burger and fries. Yeah for whole grains and veggies!
Great place! \n\nI have to say the menu and the outdoor seating keep us coming back. The food is good -- had breakfast both times but some friends had lunch items. Definitely a great selection. We've been at off-peak times so no waiting and better service. \n\nAll in all, it's no DZ Akins but it's definitely worth trying!
Went to Rebecca's Cafe today during my lunch break today. I have to say, my sandwich was delicious! I had a Chicken Caprese panini (grilled chicken, tomato, mozzarella cheese, spinach, and balsamic vinaigrette). The staff was friendly and the service was quick, however the price is a little high for my taste ($7 or so for the sandwich only). Overall, I would definitely go back to try some different items (or even get the same sandwich), but it won't be too often.
The only thing I've ever eaten at Time-Out is their ice cream cookie sandwich thing, and that's really all I need. It's about a half-gallon of ice cream sandwiched between two large-ish chocolate chip cookies. I don't know if they make it, or if it comes off a truck, or what... but it's frozen so solid you usually have to wait a few minutes before chowing down so you don't break a tooth.\n\nIf nothing else, they are one of the few old standards on Franklin Street that have not succumbed to the high-end shopping takeover, so they are worth being excited about.\n\nDo watch out for drunken Tarheels, though. This place has more disgusting drunkenness issues late at night than any bar I've been in.
Pros: Very clean and pretty little place. Really sweet-natured and attentive servers. Clean bathrooms (Haha it says a lot, ok?) Reasonable prices \n\nCons: Pho broth started ok but after a while starts to taste bitter and weird and left my mouth really dry. Meat is kinda chewy.\n\nThis place could have been SO AWESOME. A pho place so close to campus is one of those \
This is my favorite place of all time. I've driven from LA to SLO just to eat here. The barbecue sauce is AMAZING! Everything here is AMAZING! I can never decide what to get when I go here. Here's what I rotate between; tri tip sandwich, chicken wings and the ABC burger. Their fries are delicious also! Wings are fried and spicy. ABC burger is ten times more delicious then In N Out. I don't even know what to say about the trip tip sandwich. Words can not describe it. If you are a meat eater, you'll love it. It's bread, bbq sauce and like 2 inches of steak. Just amazing! My mouth is watering trying to describe the food.
After enduring years of crappy, undercooked, bad pizza in and around Ann Arbor (a la pizza house, cottage inn, and the like), Silvio's brings real pizza to town. Real pizza should be thin, have a crispy, bubbly, sourdough crust and be full of flavor, not grease, and this is the only place to get it.\n\nDon't be fooled by the \
The average rating for this place is right on point: a-ok. We came here with a relatively large group on a Friday night a little after ten o'clock. The first floor of the place was packed, but we found plenty of space upstairs in the corner, which is a pretty weird room considering how removed it is from everything. Or waiter was good about taking everyone's drink orders and being patient about it, though at one point he gave the right drinks to the wrong people. He was also cool about serving us while we played a stupid game with post its on our foreheads. We didn't order any food, but I was impressed that their kitchen is open so late and the menu is pretty big. \n\nWe were having a fun time until the band came on down stairs. The blaring sound was inescapable, even though we were kind of removed from where everything was happening. The music was so loud it was almost impossible to hear what anyone was saying right next to you. In case you're planning to pay with Discover, they DO accept it; not so cooperative on the check splitting though.
For the grand finale of my LA trip, we head over to Firestone Grill immediately after Splash Cafe.\n\nI like this place for many reasons. Obviously the tri-tip sandwich was just off da hook delicious even after clam chowder & calamari literally 5 mins ago. I also think this is one of the largest, most laid back sports bar/grills I've been to. The ladies at the register and the bartenders were very friendly in answering my questions. There is a large dining area inside and the humongous patio outside, but I don't doubt that people may end up spilling onto the sidewalk during game days as it was already quite packed for a Sunday evening. Despite the noise factor, I felt there was good energy from everyone and I was pretty comfortable. They have plenty of wide screen LCD TVs hung along the walls for your viewing pleasure (and the picture is clear :P). I don't think they have servers so you have to walk up to the counter or bar for service. Not a big deal to me IMO. \n\nSome might say that it was premature to give this a 5 star rating, but I have faith that the next time I visit (and I will go back), I will not be disappointed.
I've used this service for years. The best parts are that it tends to be on time and that there are bike racks on the front of the buses. Unfortunately the drivers must navigate roads filled with undergraduates on bicycles who think they are immortal. I've seen some close calls, but the drivers get serious kudos for being able to drive in that environment all day.\n\nThe drivers tend to be helpful and friendly, though some of them don't speak excellent English. There are a few routes I get a bit car-sick on, but that is due to the twisty roads and frequent stopping-accelerating. \n\nThe routes change, especially if there is construction going on. Make sure you check to see what is running during the summer and holidays. I've waited for shuttles that weren't going to come for days before.
Cannot believe how highly rated this place is, wow. Takes forever for them to get orders out even when its super slow. Our pizza took something like 65 minutes from the time we ordered it. Now i will wait that long if the pizza is going to be amazing, like at pizza popilis down in greek town. But this pizza was average at best. I think that the gluten free offerings are great but I really expected alot more. Great renovations btw the place looks amazing.
We went here a few weeks ago on the premiere weekend of Horrible Bosses. Loved that movie! \nMy wife wanted to come to a real movie theater that was more of a classic than one of these \
One of the best nights of my life. Hands down.\n\nIt was just amazing... The amount of people that gathered all for the same thing, The music pumping through your body everywhere you walked, The day turned to night, and the fireworks that illuminated the sky as you danced without a care to the blissful beats of Deadmau5 and others. It was a night that I will never forget.\n\nYes, it was not perfect, but what event of that size ever has been. I mean you don't show up to the largest rave in america and not expect there to be a shit load of people there. The numbers have been increasing every year, so why would this one be any different? If not, it would only be even bigger, since the biggest rising dance artist of the year, and the biggest name in trance are headlining...\n\nAnd as for all that crap about fence jumping and people getting hurt, and that one girl even died!.. OK, my condolences go out to that girls family, having a death in the family is hard, i cant even imagine it being your child.... BUT you cannot blame her taking drugs on an event, any more than video-games for kids shooting other kids. Its their own ignorance and apathy. As for all those people who got hurt. More than likely their own fault. Some 250 out of 180,000 people messed up, and now theres this huge dim view on raves. Cause there have been less riots and fights due to hockey, or football, or basketball, etc. right? (note: sarcasm above)\n\nMy final word/opinion: Definitely the best massive I've ever been to. Just as good, if not better than last year! Can't wait for next year!
I was in a desperate need of a good falafel and once again Yelp didn't let me down.\n\nOnce in the small deliciously smelling restaurant, I decided to go with the lamb gyro. I read many good things and decided to try it out, and was praying it wouldn't turn my entire mouth yellow(as stated by a few yelpers). \n\nI am happy to report that the gyro was deliciously seasoned and had huge pieces of lamb wrapped up inside. I couldn't tell you if my mouth turned yellow because I made it a point to not
file_input = open('yelp.dat', 'r')
convert_lowercase = [line.lower() for line in file_input]
with open('yelp.dat', 'w') as out:
out.writelines(sorted(convert_lowercase))
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('yelp.dat') as file_input:
f = file_input.read()translate(remove)
i ran it but i checked the file there's still punctuation. where did i go wrong?
A couple issues I noticed.In your code, you open your file with file_input = open("file", "r"). This opens the file in read mode, but you don't receive the contents of this file until you do file_input.read(). Additionally, when using open() without a context manager, you should close the file with file_input.close(). Some code that would work:
import string
file_input = open("yelp.dat", "r")
file_content = file_input.read() # Read the contents of the file
convert_lowercase = file_content.lower() # No need to convert each line individually.
file_input.close() # Close the file
for punctuation in string.punctuation: # Go through the punctuation and replace it with an empty string (remove it) in your content.
convert_lowercase.replace(punctuation, "")
with open("yelp.dat", "w") as file:
file.write(convert_lowercase) # This overwrites all current content that is inside the file.

Making brown clusters

I was playing around with P.Liang's brown clustering code link. To give it a try, I induced clusters on the text of "pride and prejudice". The clusters that I got were not so good. Some examples,
"further agreeable attempting pleasing reference"
"exempt identification deductible Service returns"
"impertinence amazement amusing"
"addresses astonished openly insincere conceit impertinent"
Do I need to perform some preprocessing(like removing stopwords, lemmatizing) before inducing the clusters?
In my experience you get much better clusters if you use larger values of k. You can then use the path prefixes to trim down the number of clusters.
ybisk has a good suggestion, when I tried to replicate your experiment I had better results with larger clusters. Some clusters are difficult to interpret, but there were a few with clear patterns, like this one for relations:
own dear sister father mother friend sister, uncle sisters aunt
sister's daughter manners mother, former father, brother aunt,
daughters mother's dear, friends spirits cousin daughter, husband
Catherine, brother, sister. own, father's feelings, friend, ladyship
eldest thoughts friend. Catherine's sisters, side, marriage, opinion,
friends, acquaintance, daughters, dearest wife, daughter. vanity
cousin,
lemmatizing and removing punctuation/capitalization would probably improve the clusters (I'm noticing a lot of duplicate words with trailing commas/periods in my results). I'm not sure removing stopwords would help, they could contain useful contextual information (e.g. day names will appear near words like "on" more often).

How do I get the context of a sentence?

There is a questionnaire that we use to evaluate the student knowledge level (we do this manually, as in a test paper). It consists of the following parts:
Multiple choice
Comprehension Questions (I.e: Is a spider an insect?)
Now I have been given a task to make an expert system that will automate this. So basically we have a proper answer for this. But my problem is the "comprehension questions". I need to compare the context of their answer to the context of the correct answer.
I already initially searched for the answer, but it seems like it's really a big task to do. What I have search so far is I can do this through NLP which is really new to me. Also, if I'm not mistaken, it seems like that I have to find a dictionary of all words that is possible for the examiner to answer.
Am I on the right track? If no, please suggest of what should I do (study what?) or give me some links to the materials that I need. Also, should I make my own dictionary? Because the words that I will be using are in the Filipino language.
Update: Comprehension question
The comprehension section of the questionnaire contains one paragraph explaining a certain scenario. The questions are fairly simple. Here is an example:
Bonnie's uncle told her to pick apples from the tree. Picking up a stick, she poked the fruits so they would fall. In the middle of doing this, a strong gust of wind blew. Due to her fear of the fruits falling on top of her head, she stopped what she was doing. After this, though, she noticed that the wind had caused apples to fall from the tree. These fallen apples were what she brought home to her uncle.
The questions are:
What did Bonnie's uncle tell her to do?
What caused Bonnie to stop picking apples from the tree?
Is Bonnie a good fruit picker? Please explain your answer.
The possible answers that the answer key states are:
For number 1:
1.1 Bonnie's uncle told her to pick apples from the tree
1.2 Get apples
For number 2:
2.1 A strong gust of wind blew
2.2 She might get hit in the head by the fruits
For number 3:
3.1 No, because the apples she got were already on the ground
3.2 No, because the wind was what caused the fruits to fall
3.3 Yes, because it is difficult to pick fruits when it's windy.
3.4 Yes, because at least she tried
Now there are answers that were given to me. The job that the system shall be able to do is to compare the context of the student's answer to the context of the right answer in order for the system to successfully be able to grade the student's answer.
One simplistic way of doing this that I can think of (off the top of my head) is to use a string similarity metric like cosine or jaccard to identify whether certain keywords appear in a test answer and the known correct answer.
Extracting these keywords automatically could be done with part of speech tagging using NLP. For example, you could extract all nouns (and possibly verbs). Then, representing each answer as a vector of keywords, you could compare the test vector with the known correct vector.
For example, in the second question, the vector for the two possible answers could be
gust, wind, blew
hit, head, fruits
An answer like "she picked up a stick" with the keywords: picked, stick would have a very low score as compared to something like "afraid of fruit falling on her head" with keywords: fruit, falling, head.
Notes:
This can detect only wildly wrong answers. Wrong answers containing the right keywords would not be detected by this technique. :)
I'm not sure about non-english sentences. If that is the case, you might want to take every word in the answer as a keyword (removing stopwords). This question might help as well.

Resources