How to characterize a distribution of values? - statistics

I try to explain it with an example.
In a school there are n classes. In each classe there are k students, with k from 1 to 700, both n and k are known.
I need a way to characterize, for each class, the distribution of the names of students. For example, in class A there are 10 students, 3 are named "John", 3 "Mark" and 3 "Anne". In another class there are 100 student and everyone is named "Anton".
I need a measure able to be indicative of names distribution in each class. For example, (it's not important), it may be 1 if everyone in a class has the same name and 0 if there aren't 2 identical names in the same class.
In other words a way to sort classes by the distribution of names.

Sounds like you want a "contingency table". It's arbitrary which of your variables you want to have as rows vs. columns, but the table entries are either counts or proportions of how many occurrences fall in the intersection of the categories.
With the example you gave:
Class
A B
_________________
Anne | 3 | 0 | 3
Names Anton | 0 | 100 | 100
John | 3 | 0 | 3
Mark | 3 | 0 | 3
Unknown | 1 | 0 | 1
|--------|--------|----
10 100 | 110
Values at the right and along the bottom are called the "marginal totals", or if proportions, "marginal distributions". The bottom right corner is the grand total of your data, obtained by summing the row or column margins. (They better come out the same!) For proportions, the sum must be 1.

Related

How to reference another cell in a chart based on an aggregating total in another cell

So, the title might be confusing, so I'll outline like this:
I am making a weightloss chart. One of the clients gets to open a bag of legos as a reward for every 2lbs that he loses, as long as he does it based on a goal progression. For instance, if he weights 260, and loses 2lb, he gets his reward. However, if he gains a lb, now he has to lose 3lb to get his reward.
Currently, I have charts that look like this:
Column O
Column P
Current Weight
Amount Lost
263
8
Column L
Column M
Next Lego Bag
261
Lbs until next bag
2
After he hits 261, I want that cell that says 261 in Col M to say "259". So if he weighs in again, I want it to look like this automatically.
Column O
Column O
Current Weight
Amount Lost
260.5
10.5
Column L
Column M
Next Lego Bag
259
Lbs until next bag
1.5
What is the best way to automatically make that cell in Column M change when he hits the 2lb goal? I have a table that basically states all the goal weighs he needs to hit for each reward. It looks like this:
| Column Z | Column AA | Column AB | (formatting is being weird)
| -------- | -------- | -------- |
| Bag | Target Weight | Amount Lost |
| Bag 5 | 261 | 8 |
| Bag 6 | 259 | 10 |
| Bag 7 | 257 | 12 |
| Bag 8 | 255 | 14 |
| Bag 9 | 253 | 16 |
etc
I've tried a few things, but I'm coming up blank, because it won't always be in whole numbers the amount he loses, so matching it to the target weight has been tough.
In really, really simple terms, I need it to basically say this:
If current weight > goal 1, then A1 = goal 1. If current weight < Goal 1, then A1 = Goal 2, and all the way to Goal 21. However, A1 can't change to the next goal until current weight is less than that goal.
Thanks all
I have tried IF statements and Floor statements to get an ongoing changing thing, but it's not working.
In M2: =IF(MOD(O2+1,2)=0,2,MOD(O2+1,2))
In M1:
=O2-M2
Or using O365 in M1:
=LET(m,MOD(O2+1,2),
lbs,IF(m=0,2,m),
VSTACK(O2-lbs,lbs))

EXCEL: How to automatically create groups based on sum being less than X and not greater than Y

I have table in Excel with some information, the main column is Weight (in KG).
I need Excel to group Rows into groups, where each group's sum of Weight (in KG) is less than 24000 kg and greater than 23500 kg.
To do so manually is very time consuming, since there are thousands of rows with different Weight values.
table example:
ID | Weight (KG)
1 | 11360
2 | 22570
3 | 10440
4 | 20850
5 | 9980
6 | 9950
7 | 19930
8 | 9930
9 | 9616
10 | 9580
... and so on
The closest I got to solving the problem is adding 3 new columns: Total, Starts Group and Group Number.
Total function: =IF(SUM(B3+C2)>24000,B3,SUM(B3+C2)) - calculates current sum of Weight values in the current group
Starts group function: =IF(SUM(B3+C2)>24000,B3,SUM(B3+C2)) - checks if current row makes a new group
Group number function: =IF(D3,E2+1,E2) - all rows that contain same number are in the same group
The problem with this is that it doesn't create groups that are greater than 23500 too, but only that are less than 2400 kg.
It doesn't have to be in Excel, any app/script would work too, it just has to get the job done.
Desired output:
ID | Weight (KG) | Group ID
1 | 11360 | 1
2 | 2570 | 2
3 | 10440| 1
4 | 20850 | 2
5 | 180| 2
6 | 1950 | 1
So i want to get groups similar to these:
Group number 1 - Total 23750kg
Group number 2 - Total 2360kg
Url to my example table with functions I added:
https://1drv.ms/x/s!Au0UogL2uddbgTFJJ4TzSKLhPFPE?e=r02sPX
You may want to try this for total:
=IF(SUM(B3+C2)>24000;B3;IF(SUM(B3+C2)<=23500;SUM(B3+C2);B3))
edit:
I just saw you pasted the proposal into your sample file. You may need to replace the ; with , due to regional format settings.
The limitation remains:
first priority is <24k and second priority is >=23.5k
If the next row’s value makes the “jump” above 24k you may end up remaining below 23.5k and switching to the next group
edit2:
You may want to look up some optimization models and algorithms for your combination problem before trying to implement it in Excel.
Or try with simple rules, e.g. categorizing your rows such as weight over 20k, 16k, 12k,8k, 4k, 2k, 1k, 500, etc. and try to group/combine them accordingly

Excel: Sum cells if they share an identical unknown string

I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen

Display Value in cell Y based on greater than, less than of cell X

Here's the scenario. I have a large spreadsheet of candidates for NHS at my school that are given a score by several teachers, community members, etc. I average out their score and then based on that number they are given a score/value from a rubric. I am looking for a formula that will read the value of cell X (their average score) and display a specific value in cell Y(their rubric score). The following is the criteria:
value<2.0, display 0
value>2.0 value<3.0, display 1
value>3.0 value<3.5, display 2
value>3.5 value<3.75, display 3
value>3.75, display 4
I tried looking this up and the closest I found was a formula that I modified to look like this:
=IF(I10="AVERAGE_CHARACTER",IF(I10<2,0,IF(AND(I10>2,I11<3),1,IF(AND(I10>3,I11<3.5),2,IF(AND(I10>3.5,I11<3,75),3,IF(I11>3.75,4,0))))))
All it says is FALSE in the cell. Not sure if I'm using the wrong formula or have a typo in the formula. Thoughts? If there is an alternate or easier method, I'm open for suggestions.
Thanks!
source: http://www.excelforum.com/excel-formulas-and-functions/575953-greater-than-x-but-less-than-y.html
It's easy if you keep the thresholds and the rubric in separate arrays:
=LOOKUP(A1,{0,2,3,3.5,3.75},{0,1,2,3,4})
You might use something like: (value to be changed in A1)
=VLOOKUP(A1,{0,0;2,1;3,2;3.5,3;3.75,4},2)
or having a table like this: (value to be changed in C1)
| A | B |
1 | 0 | 0 |
2 | 2 | 1 |
3 | 3 | 2 |
4 | 3.5 | 3 |
5 | 3.75 | 4 |
=VLOOKUP(C1,A1:B5,2)

What are co-occurence matrixes and how are they used in NLP?

The pypi docs for a google ngram downloader say that "sometimes you need an aggregate data over the dataset. For example to build a co-occurrence matrix."
The wikipedia for co-occurence matrix has to do with image processing and googling the term seems to bring up some sort of SEO trick.
So what are co-occurrence matrixes (in computational linguistics/NLP)? How are they used in NLP?
What is a co-occurrence matrix ?
Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC.
As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.
In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence.
Consider the following text :
Roses are red. Sky is blue.
With the classic approach described before, we'll have the following matrix :
| Roses | are | red | Sky | is | blue
Roses | 1 | 1 | 1 | 0 | 0 | 0
are | 1 | 1 | 1 | 0 | 0 | 0
red | 1 | 1 | 1 | 0 | 0 | 0
Sky | 0 | 0 | 0 | 1 | 1 | 1
is | 0 | 0 | 0 | 1 | 1 | 1
Blue | 0 | 0 | 0 | 1 | 1 | 1
Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word.
What are they used for in NLP ?
The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that :
ProductX is amazing.
I hate productY.
Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.
The co-occurrence matrix indicates how many times the row word (e.g. 'digital') is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. 'pie').
The entry '5' in the following table, for example, means that we had 5 sentences in our text where 'digital' was surrounded by 'pie'.
These sentences could have been:
I love a digital pie.
What's digital is often a pie.
May I have some digital pie?
Digital world necessitates pie-eating.
There's something digital about this pie.
Note that the co-occurrence matrix is always symmetric - the entry with the row word 'pie' and the column word 'digital' will be 5 as well (as these words co-occur in the very same sentences!).

Resources