Ratios of co-occurrence probabilities can encode meaning components - nlp

I'studying NLP these days through CS224N, which is NLP lecture in Stanford, and I had a question of co-occurrence probability.
I can understand that each first row and second row shows co-occurrence probability, but it's hard to understand last row. Hard to figure out what it is...
In lecture, said that meaning component is something like female to male, king to queen. I thought would be gender in this example.
In last row I thought it was condition, but hard to find correlation between Large and Small..
enter image description here

Related

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.

BLEU score value higher than 1

I've been looking at how BLEU score works. What I understood from the online videos + the original research paper is that BLEU score value should be within the range 0-1.
Then, when I started to look at some research papers, I found that BLEU value (almost) always higher than 1!
For instance, have a look here:
https://www.aclweb.org/anthology/W19-8624.pdf
https://arxiv.org/pdf/2005.01107v1.pdf
Am I missing something?
Another small point: what does the headers in the table below mean? The BLEU score was calculated using unigrams, then unigrams & bigrams (averaged), etc.? or each ngrams size was calculated independently?

Python based multi-label Classification

I have a data set something like shown below which in real scenario wil have row count something between 10000 to 1000000.
There would be more columns but the core problem revolves round these two fields.
Known Labels
I have known categories -'Apple', 'Blueberry','Orange','Lettuce'
Dataset
DataFrame
({'ROWID':1,2,3,4,5,6,7,8,9,10],
'Category':'Apple','Blueberry'.'Orange','Lettuce','Fruit','Salad','xyz','Fruit'
,'Leaf','Avocado'],
'Details':['Eat one a day ,doctors keep away','Like it in a muffin',
'Tastes yummy','Like it with
salmon','Glass of a juice','Ceser dressing on lettuce','Nothing in my
basket','Like it in a muffin','I like it it with salami','Comes from
Mexico']})
Problem:
I have to create one or many metrics using groupby on category
When the category column has unknown cell value I need to read the text from the 'Details' and predict the best suited label for category.
For example
Salad ->Lettuce, Fruit(Row#5)-> Orange Fruit(Row#8)-> Blueberry
Leaf(Row#9)-> 'Lettuce' It is understood that some of the rows can
not be categorized.
Help Needed:
I am a newbie in data science algorithm, looking for some guidance to identify the right model to solve the problem.
Use Naive Bayes for the Details column, before that do a simple filtering on the Category column and remove rows having known category values.

Using the Rank Function In Excel

Sorry if this has been answered and I feel it may have but I am struggling to find an answer that helps me to the point of success.
I have a basic spreadsheet for time trial results. The spreadsheet is for both men and women. Basically, points are awarded for the quickest times throughout the entire competitors on 30 second intervals which is fine (Cloumn N)...(I have managed this)
My question is - On top of this the top 7 men in ranked position is awarded additional bonus points and the top 3 (only because there is normally less women attending the events than men) women are also additional awarded bonus points.
I have set up a column to specify M or F (Column C) when a competitor is added, and also using RANK
=IF(G7=0,0,RANK(G7,$G$6:$G$36,1)-COUNTIF($G$6:$G$36,0))
on the times - Column K
But I am really struggling with how to use a formula to extract the top 7 men and top 3 women and award the points. Ie there will be a 1st place man - 7th but also a 1st place woman - 3rd. So in essence is there any way I can extract the two sets of rankings from the identification of F and M from the appropriate column.
At the moment I can only get the a basic ranking and using an IF(AND) statement I can return results to apply the bonus points if the conditions are matched but this doesnt help with identifying the rankings according to Male (1st-7th) or Female (1st-3rd)
You can also see on my screen dump that although I havent added the formula for assigning the female points that because of the conditions been met I dont have bonus points awarded for 5th place because I set sex to F which I was hoping someone could also help me with
Sorry for waffling but I have been toiling with this 3 days now and I am just going in circles
Really appreciate any reply
Just use COUNTIFS:
=IF(G6=0,0,COUNTIFS(C:C,C6,G:G,"<" & G6,G:G,"<>0")+1)
This will rank the like entries in C, thus giving two 1st, one male and one female.
To add for the Club just add another condition:
=IF(G6=0,0,COUNTIFS(C:C,C6,G:G,"<" & G6,G:G,"<>0",B:B,B6)+1)

Simulation in Excel using probability

I am trying to create a spreadsheet that can find the most likely probability that a student scored a specific grade on a test.
Only one student can score a grade and only one grade can have a student.
I have limited information about each student.
There are 5 students (1,2,3,4,5)
and the grades possible are only (100,90,80,70,60)
In the spreadsheet a 1 denotes that the student DIDN'T score that grade.
Does anyone know how to make a simulation that I can find the most likely probability of what student scored what grade?
Link:
https://docs.google.com/spreadsheets/d/1a8uUIRzUKsY3DolTM1A0ISqMd-42WCUCiDsxmUT5TKI/edit?usp=sharing
Based on your response in comments, each student has an equal likelihood of getting each grade. No simulation is necessary.
If you want to simulate it anyway, don't use Excel*. Create a vector of students, and pair it with a shuffled vector of the grades. Lather, rinse, repeat as many times as you want to see that the student-to-grade matching is uniformly distributed.
* - To get an idea of how bad Excel can be for random variate generation, enable the Analysis Toolpak, go to "Data -> Data Analysis" on the ribbon, and select "Random Number Generation". Fill in the tabs that you want 10 variables, number of random numbers 2000, a "Normal" distribution, leave the mean and std dev at 0 and 1, and enter a "Random Seed" value of 123. You will find that the resulting table contains 3 instances of the value "-9.35764". Values that extreme should occur about once per twenty thousand years if you generate a billion a second. Getting three of them is so extreme that it should happen once per 1030 times the current estimated age of the universe. Conclude that a) it's your lucky day, or b) Excel sucks at random numbers, and despite being informed about this as far back as 1998 Microsoft hasn't bothered to fix it.

Resources