What are co-occurence matrixes and how are they used in NLP? - nlp

The pypi docs for a google ngram downloader say that "sometimes you need an aggregate data over the dataset. For example to build a co-occurrence matrix."
The wikipedia for co-occurence matrix has to do with image processing and googling the term seems to bring up some sort of SEO trick.
So what are co-occurrence matrixes (in computational linguistics/NLP)? How are they used in NLP?

What is a co-occurrence matrix ?
Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC.
As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.
In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence.
Consider the following text :
Roses are red. Sky is blue.
With the classic approach described before, we'll have the following matrix :
| Roses | are | red | Sky | is | blue
Roses | 1 | 1 | 1 | 0 | 0 | 0
are | 1 | 1 | 1 | 0 | 0 | 0
red | 1 | 1 | 1 | 0 | 0 | 0
Sky | 0 | 0 | 0 | 1 | 1 | 1
is | 0 | 0 | 0 | 1 | 1 | 1
Blue | 0 | 0 | 0 | 1 | 1 | 1
Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word.
What are they used for in NLP ?
The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that :
ProductX is amazing.
I hate productY.
Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.

The co-occurrence matrix indicates how many times the row word (e.g. 'digital') is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. 'pie').
The entry '5' in the following table, for example, means that we had 5 sentences in our text where 'digital' was surrounded by 'pie'.
These sentences could have been:
I love a digital pie.
What's digital is often a pie.
May I have some digital pie?
Digital world necessitates pie-eating.
There's something digital about this pie.
Note that the co-occurrence matrix is always symmetric - the entry with the row word 'pie' and the column word 'digital' will be 5 as well (as these words co-occur in the very same sentences!).

Related

How to reference another cell in a chart based on an aggregating total in another cell

So, the title might be confusing, so I'll outline like this:
I am making a weightloss chart. One of the clients gets to open a bag of legos as a reward for every 2lbs that he loses, as long as he does it based on a goal progression. For instance, if he weights 260, and loses 2lb, he gets his reward. However, if he gains a lb, now he has to lose 3lb to get his reward.
Currently, I have charts that look like this:
Column O
Column P
Current Weight
Amount Lost
263
8
Column L
Column M
Next Lego Bag
261
Lbs until next bag
2
After he hits 261, I want that cell that says 261 in Col M to say "259". So if he weighs in again, I want it to look like this automatically.
Column O
Column O
Current Weight
Amount Lost
260.5
10.5
Column L
Column M
Next Lego Bag
259
Lbs until next bag
1.5
What is the best way to automatically make that cell in Column M change when he hits the 2lb goal? I have a table that basically states all the goal weighs he needs to hit for each reward. It looks like this:
| Column Z | Column AA | Column AB | (formatting is being weird)
| -------- | -------- | -------- |
| Bag | Target Weight | Amount Lost |
| Bag 5 | 261 | 8 |
| Bag 6 | 259 | 10 |
| Bag 7 | 257 | 12 |
| Bag 8 | 255 | 14 |
| Bag 9 | 253 | 16 |
etc
I've tried a few things, but I'm coming up blank, because it won't always be in whole numbers the amount he loses, so matching it to the target weight has been tough.
In really, really simple terms, I need it to basically say this:
If current weight > goal 1, then A1 = goal 1. If current weight < Goal 1, then A1 = Goal 2, and all the way to Goal 21. However, A1 can't change to the next goal until current weight is less than that goal.
Thanks all
I have tried IF statements and Floor statements to get an ongoing changing thing, but it's not working.
In M2: =IF(MOD(O2+1,2)=0,2,MOD(O2+1,2))
In M1:
=O2-M2
Or using O365 in M1:
=LET(m,MOD(O2+1,2),
lbs,IF(m=0,2,m),
VSTACK(O2-lbs,lbs))

Excel: Sum cells if they share an identical unknown string

I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen

Spark: count events based on two columns

I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.

How to characterize a distribution of values?

I try to explain it with an example.
In a school there are n classes. In each classe there are k students, with k from 1 to 700, both n and k are known.
I need a way to characterize, for each class, the distribution of the names of students. For example, in class A there are 10 students, 3 are named "John", 3 "Mark" and 3 "Anne". In another class there are 100 student and everyone is named "Anton".
I need a measure able to be indicative of names distribution in each class. For example, (it's not important), it may be 1 if everyone in a class has the same name and 0 if there aren't 2 identical names in the same class.
In other words a way to sort classes by the distribution of names.
Sounds like you want a "contingency table". It's arbitrary which of your variables you want to have as rows vs. columns, but the table entries are either counts or proportions of how many occurrences fall in the intersection of the categories.
With the example you gave:
Class
A B
_________________
Anne | 3 | 0 | 3
Names Anton | 0 | 100 | 100
John | 3 | 0 | 3
Mark | 3 | 0 | 3
Unknown | 1 | 0 | 1
|--------|--------|----
10 100 | 110
Values at the right and along the bottom are called the "marginal totals", or if proportions, "marginal distributions". The bottom right corner is the grand total of your data, obtained by summing the row or column margins. (They better come out the same!) For proportions, the sum must be 1.

How to dynamically create a cumulative overall total based on a non-cumulative categorical column in excel

Slightly wordy title but here goes
I have a grid in excel which includes 3 columns (media spend, marginal revenue returns & media channel invested in) and I want to create the column below called desired cumulative spend
The reason the grid is structured in this way it does is that it represents an optimised spend laydown ordered by how much of each media channel's budget should be invested in until the marginal returns diminish such that it should be substituted for another media channel.
It is possible that this substitution can then be reversed back to the original channel if the new channel has a sharply diminishing curve, such that all marginal benefit associated to the new channel diminishes and the total spend level still means it is mathematically sensible to switch back to the original curve (maybe it has a lower base level but reduces less sharply). It is also possible that at the point in which the marginal benefit associated to the new channel diminishes, the best next step is to invest in a third channel.
The desired new spend column has two elements to it
it is a simple accumulation of spend from row to row when the
media channel is constant from row to row
it is a slightly more tricky accumulation of spend when the media
channel changes - then it needs to be able to reference back to the
last spend level associated to the channel which has been
substituted in. For row 4, the logic I am struggling with would need
to the running total from row 3 plus the new spend level associated
to row 4 minus the spend level the last time this channel was used
(row 2)
|spend | mar return | media | desired cumulative spend |
|------ |----------- |-------| ----------------------------------------- |
1 | £580 | 128 | chan1 | 580 |
2 | £620 | 121 | chan1 | 580+(620-580) |
3 | £900 | 115.8 | chan2 | 580+(620-580)+900 |
4 | £660 | 115.1 | chan1 | 580+(620-580)+900+(660-620) |
5 | £920 | 114 | chan2 | 580+(620-580)+900+(660-620)+(920-900) |
6 | £940 | 112 | chan2 | 580+(620-580)+900+(660-620)+(920-900)+(940-920) |
If my comment is the correct sugestion, then something like this should do it (£580 is at A2, so the first output is D2):
D2 =A2
D3 =D2+A3-IF(COUNTIF($C$2:C2,C3),INDEX(A:A,MAX(IF($C$2:C2=C3,ROW($A$2:A2)))))
D3 contains an array formula and must be confirmed with ctrl+shift+enter.
Now you can simply copy down from D3.

Resources