I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen
Related
I have table in Excel with some information, the main column is Weight (in KG).
I need Excel to group Rows into groups, where each group's sum of Weight (in KG) is less than 24000 kg and greater than 23500 kg.
To do so manually is very time consuming, since there are thousands of rows with different Weight values.
table example:
ID | Weight (KG)
1 | 11360
2 | 22570
3 | 10440
4 | 20850
5 | 9980
6 | 9950
7 | 19930
8 | 9930
9 | 9616
10 | 9580
... and so on
The closest I got to solving the problem is adding 3 new columns: Total, Starts Group and Group Number.
Total function: =IF(SUM(B3+C2)>24000,B3,SUM(B3+C2)) - calculates current sum of Weight values in the current group
Starts group function: =IF(SUM(B3+C2)>24000,B3,SUM(B3+C2)) - checks if current row makes a new group
Group number function: =IF(D3,E2+1,E2) - all rows that contain same number are in the same group
The problem with this is that it doesn't create groups that are greater than 23500 too, but only that are less than 2400 kg.
It doesn't have to be in Excel, any app/script would work too, it just has to get the job done.
Desired output:
ID | Weight (KG) | Group ID
1 | 11360 | 1
2 | 2570 | 2
3 | 10440| 1
4 | 20850 | 2
5 | 180| 2
6 | 1950 | 1
So i want to get groups similar to these:
Group number 1 - Total 23750kg
Group number 2 - Total 2360kg
Url to my example table with functions I added:
https://1drv.ms/x/s!Au0UogL2uddbgTFJJ4TzSKLhPFPE?e=r02sPX
You may want to try this for total:
=IF(SUM(B3+C2)>24000;B3;IF(SUM(B3+C2)<=23500;SUM(B3+C2);B3))
edit:
I just saw you pasted the proposal into your sample file. You may need to replace the ; with , due to regional format settings.
The limitation remains:
first priority is <24k and second priority is >=23.5k
If the next row’s value makes the “jump” above 24k you may end up remaining below 23.5k and switching to the next group
edit2:
You may want to look up some optimization models and algorithms for your combination problem before trying to implement it in Excel.
Or try with simple rules, e.g. categorizing your rows such as weight over 20k, 16k, 12k,8k, 4k, 2k, 1k, 500, etc. and try to group/combine them accordingly
I try to explain it with an example.
In a school there are n classes. In each classe there are k students, with k from 1 to 700, both n and k are known.
I need a way to characterize, for each class, the distribution of the names of students. For example, in class A there are 10 students, 3 are named "John", 3 "Mark" and 3 "Anne". In another class there are 100 student and everyone is named "Anton".
I need a measure able to be indicative of names distribution in each class. For example, (it's not important), it may be 1 if everyone in a class has the same name and 0 if there aren't 2 identical names in the same class.
In other words a way to sort classes by the distribution of names.
Sounds like you want a "contingency table". It's arbitrary which of your variables you want to have as rows vs. columns, but the table entries are either counts or proportions of how many occurrences fall in the intersection of the categories.
With the example you gave:
Class
A B
_________________
Anne | 3 | 0 | 3
Names Anton | 0 | 100 | 100
John | 3 | 0 | 3
Mark | 3 | 0 | 3
Unknown | 1 | 0 | 1
|--------|--------|----
10 100 | 110
Values at the right and along the bottom are called the "marginal totals", or if proportions, "marginal distributions". The bottom right corner is the grand total of your data, obtained by summing the row or column margins. (They better come out the same!) For proportions, the sum must be 1.
Here's the scenario. I have a large spreadsheet of candidates for NHS at my school that are given a score by several teachers, community members, etc. I average out their score and then based on that number they are given a score/value from a rubric. I am looking for a formula that will read the value of cell X (their average score) and display a specific value in cell Y(their rubric score). The following is the criteria:
value<2.0, display 0
value>2.0 value<3.0, display 1
value>3.0 value<3.5, display 2
value>3.5 value<3.75, display 3
value>3.75, display 4
I tried looking this up and the closest I found was a formula that I modified to look like this:
=IF(I10="AVERAGE_CHARACTER",IF(I10<2,0,IF(AND(I10>2,I11<3),1,IF(AND(I10>3,I11<3.5),2,IF(AND(I10>3.5,I11<3,75),3,IF(I11>3.75,4,0))))))
All it says is FALSE in the cell. Not sure if I'm using the wrong formula or have a typo in the formula. Thoughts? If there is an alternate or easier method, I'm open for suggestions.
Thanks!
source: http://www.excelforum.com/excel-formulas-and-functions/575953-greater-than-x-but-less-than-y.html
It's easy if you keep the thresholds and the rubric in separate arrays:
=LOOKUP(A1,{0,2,3,3.5,3.75},{0,1,2,3,4})
You might use something like: (value to be changed in A1)
=VLOOKUP(A1,{0,0;2,1;3,2;3.5,3;3.75,4},2)
or having a table like this: (value to be changed in C1)
| A | B |
1 | 0 | 0 |
2 | 2 | 1 |
3 | 3 | 2 |
4 | 3.5 | 3 |
5 | 3.75 | 4 |
=VLOOKUP(C1,A1:B5,2)
The pypi docs for a google ngram downloader say that "sometimes you need an aggregate data over the dataset. For example to build a co-occurrence matrix."
The wikipedia for co-occurence matrix has to do with image processing and googling the term seems to bring up some sort of SEO trick.
So what are co-occurrence matrixes (in computational linguistics/NLP)? How are they used in NLP?
What is a co-occurrence matrix ?
Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC.
As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.
In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence.
Consider the following text :
Roses are red. Sky is blue.
With the classic approach described before, we'll have the following matrix :
| Roses | are | red | Sky | is | blue
Roses | 1 | 1 | 1 | 0 | 0 | 0
are | 1 | 1 | 1 | 0 | 0 | 0
red | 1 | 1 | 1 | 0 | 0 | 0
Sky | 0 | 0 | 0 | 1 | 1 | 1
is | 0 | 0 | 0 | 1 | 1 | 1
Blue | 0 | 0 | 0 | 1 | 1 | 1
Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word.
What are they used for in NLP ?
The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that :
ProductX is amazing.
I hate productY.
Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.
The co-occurrence matrix indicates how many times the row word (e.g. 'digital') is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. 'pie').
The entry '5' in the following table, for example, means that we had 5 sentences in our text where 'digital' was surrounded by 'pie'.
These sentences could have been:
I love a digital pie.
What's digital is often a pie.
May I have some digital pie?
Digital world necessitates pie-eating.
There's something digital about this pie.
Note that the co-occurrence matrix is always symmetric - the entry with the row word 'pie' and the column word 'digital' will be 5 as well (as these words co-occur in the very same sentences!).
I want to display (list) the value of a string variable DE15_WHY in Stata only when it is not missing (e.g. some subjects did not provide comments). I thought this would be easy:
list DE15_WHY if DE15_WHY != ""
This displays DE15_WHY for all subjects even if they do not have anything in DE15_WHY...
Is the string formatted wrongly? For example, does Stata think that all subjects have a valid observation for DE15_WHY? How do I fix this? I checked, and it is formatted as a string variable.
Stata also allows me to tabulate DE15_WHY, similar to R. This is a great option but does not display the entire contents of the string variable in the table. How do I get Stata to display the entire string?
#Metrics' answer has several good details, but I will here add more.
With string variables, Stata has only one definition of missing, namely that a string is empty, and contains precisely no characters.
One or more spaces, despite usually conveying nothing to people, do not qualify as missing so far as Stata is concerned.
The term "blank" is perhaps unclear here and thus better avoided.
If spaces somehow get into your string variables a condition such as
if trim(mystring) == ""
selects values that are empty or that have spaces and correspondingly a condition such as
if trim(mystring) != ""
selects values with other content. To replace spaces with empty strings, we thus go
replace mystring = "" if trim(mystring) == ""
In general, if you have rather long strings, Stata necessarily has a problem of where to display them. One tip is that list will show more than tabulate. If you want a tabulate and list hybrid, check out groups from SSC, using ssc inst groups.
Although the period . is the default or system missing value for numeric variables (or numeric scalars or matrix elements) in Stata, it does not attach any special meaning to the string ".".
sysuse auto
list rep78 in 1/10 if rep78 !=. # for non-missing
tab rep78 # default behaviour is to report only non-missing
tab rep78, missing # if you want also missing
If variable is a string with missing indicated by .
list yourvariable if yourvariable !="."
If variable is a string with missing indicated by blank
list yourvariable if yourvariable !=""
Example:
my my1
ab 1
cd 2
3
ef 4
list my if my !=""
+----+
| my |
|----|
1. | ab |
2. | cd |
4. | ef |
+----+
tab will treat both blank and . as missing.
.
tab my
my | Freq. Percent Cum.
------------+-----------------------------------
ab | 1 33.33 33.33
cd | 1 33.33 66.67
ef | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
tab my,missing
my | Freq. Percent Cum.
------------+-----------------------------------
| 1 25.00 25.00
ab | 1 25.00 50.00
cd | 1 25.00 75.00
ef | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00