Split sentence in words with most weight - nlp

I'm working on a game where I need to find the biggest weight for a specific sentence.
Suppose I have the sentence "the quick brown fox" and assume only single words with their defined weight: "the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8
In this case the problem is trivial, as the solution consists in adding each words' weight.
Now assume we also add double words, so besides the above words, we also have "the quick" -> 5, "quick brown" -> 10, "brown fox" -> 1
I'd like to know which combination of single and double words provides the biggest weight, in this case it would be "the", "quick brown", "fox"
My question is, besides the obvious brute force approach, is there any other possible way to obtain a solution? Needless to say, I'm looking for some optimal way to achive this for larger sentences.
Thank you.

You can look at Integer Linear Program libraries like lp_solve. In this case, you will want to maximize the scores, and your objective function will contain the weights. Then you can subject it to constraints, like you cannot have "quick brown" and "brown" at the same time.
For word alignment, this was used in this paper, but your problem is way simpler than that, but you can browse through the paper to get an idea on how ILP was used. There's probably other algorithms other than ILP that can be used to solve this optimally, but ILP can solve it optimally and efficiently for small problems.

"the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8
Say for the above individual words , I shall take an array
[10,5,3,8] for words 0,1,2,3
Traverse through the list and get if the combination of two scores is less than the combined score
for example
10+5 >5 the + quick >the quick
5+3 < 10 quick brown > quick + brown . Mark This
and so on
While marking the combined solution mark them along continuous ranges .
for example
if words scores are
words = [1,2,5,3,1,4,6,2,6,8] and [4,6,9,7,8,2,9,1,2]
marked ranges (inclusive of both ends)
are [0,1],[2,5],[6,7]
Pseudo code is given below
traverse from 0 to word length - 1
if number not in range :
add word[number] to overall sum.
else:
if length of range = 1 :
add combined_word_score [ lower_end_number]
else if length of range = 2 :
add combined_word_score [ lower_end_number+next number]
else if length of range > 2 and is odd number :
add max (alternate_score_starting at lower_end_number ,
word[lower_end]+word[higher_end]+alternate_score_starting at
next_number)
else if length of range > 2 and is even number :
add max (alternate_score_starting at lower_end_number +word[higher_end],
word[lower_end]+alternate_score_starting at
next_number).

This feels like a dynamic programming question.
I can imagine the k words of the sentence placed beside each other with a light bulb in between each word (i.e. k-1 light bulbs in total). If a light bulb is switched on, that means that the words adjoining it are part of a single phrase, and if its off, they are not. So any configuration of these light bulbs indicates a possible combination of weights.. of course many configurations are not even possible because we don't have any scores for they phrases they require. So k-1 light bulbs mean there are a max of 2^(k-1) possible answers for us to go through.
Rather than brute forcing it, we can recognize that there are parts of each computation that we can reuse for other computations, so for (The)(quick)(brown fox ... lazy dog) and (The quick)(brown fox ... lazy dog), we can compute the maximum score for (brown fox ... lazy dog) only once, memoize it and re-use it without doing any extra work the next time we see it.
Before we even start, we should first get rid of the light bulbs that can have only 1 possible value (suppose we did not have the phrase 'brown fox' or any bigger phrase with that phrase in it, then the light bulb between 'brown' and 'fox' would always have to be turned off).. Each removed bulb halves the solution space.
If w1, w2, w3 are the words, then the bulbs would be w1w2, w2w3, w3w4, etc. So
Optimal(w1w2 w2w3 w3w4 ...) = max(Optimal(w2w3 w3w4 ...) given w1w2 is on, Optimal(w2w3 w3w4 ...) given w1w2 is off)
(Caveat if we reach something where we have no possible solution, we just return MIN_INT and things should work out)
We can solve the problem like this, but we can probably save even more time if were clever about the order in which we approached the bulbs. Maybe attacking the center bulbs first might help.. I am not sure about this part.

Related

How combine word embedded vectors to one vector?

I know the meaning and methods of word embedding(skip-gram, CBOW) completely. And I know, that Google has a word2vector API that by getting the word can produce the vector.
but my problem is this: we have a clause that includes the subject, object, verb... that each word is previously embedded by the Google API, now "How we can combine these vectors together to create a vector that is equal to the clause?"
Example:
Clause: V= "dog bites man"
after word embedding by the Google, we have V1, V2, V3 that each of them maps to the dog, bites, man. and we know that:
V = V1+ V2 +V3
How can we provide V?
I will appreciate if you explain it by taking an example of real vectors.
A vector is basically just a list of numbers. You add vectors by adding the number in the same position in each list together. Here's an example:
a = [1, 2, 3]
b = [4, 5, 6]
c = a + b # vector addition
c is [(1+4), (2+5), (3+6)], or [5, 7, 9]
As indicated in this question, a simple way to do this in python is like this:
map(sum, zip(a, b))
Vector addition is part of linear algebra. If you don't understand operations on vectors and matrices the math around word vectors will be very hard to understand, so you may want to look into learning more about linear algebra in general.
Normally adding word vectors together is a good way to approximate a sentence vector, since for any given set of words there's an obvious order. However, your example of Dog bites man and Man bites dog shows the weakness of adding vectors - the result doesn't change based on word order, so the results for those two sentences would be the same, even though their meanings are very different.
For methods of getting sentence vectors that are affected by word order, look into doc2vec or the just-released InferSent.
Two solutions:
Use vector addition of the constituent words of a phrase - this typically works well because addition is a good estimation of semantic composition.
Use paragraph vectors, which is able to encode arbitrary length sequence of words as a single vector.
So, In this paper : https://arxiv.org/pdf/2004.07464.pdf
They have combined image embedding and text embedding by concatenating them.
X = TE + IE
Here X is fusion embedding with TE and IE as text and image embedding respectively.
If your TE and IE have dimension of suppose 2048 each, your X will be of length 2*2024. Then maybe you can use this if possible or if you want to reduce the dimension you can use t-SNE/PCA or https://arxiv.org/abs/1708.03629 (Implemented here : https://github.com/vyraun/Half-Size)

Association Rules using Python with data in sentence form

I would like to calculate association rules from a text field from a dataset such as the one below using Python:
ID fav_breakfast
1 I like to eat eggs and bacon for breakfast.
2 Bacon, bacon, bacon!
3 I love pancakes, but only if they have extra syrup!
4 Waffles and bacon. Eggs too!
5 Eggs, potatoes, and pancakes. No meat for me!
Please note that Orange 2.7 is not an option as I am using the current version of Python (3.6), so Orange 3 is fair game; however, I can not seem to figure out how this module works with data in this format.
The first step, in my mind, would be to convert the above into a sparse matrix, such as the (truncated) one shown below:
Next, we would want to remove stop words (ie. I, to, and, for, etc.), upper/lower case issues, numbers, punctuation, account for words such as potato, potatoes, potatos, etc (with lemmatization).
Once this sparse matrix is in place, the next step would be to calculate association rules amongst all of words/strings in the sparse matrix. I have done this in R using the arulespackage; however, I haven't been able to identify an "arules equivalent" for Python.
The final solution that I envision would include a list of left-hand and right-hand side arguments along with the support, confidence, and lift of the rules in descending order with the highest lift rules at the top and lowest lift rules at the bottom (again, easy enough to obtain in R with arules).
In addition, I would like to have the ability to specify the right hand side to "bacon" that also shows the support, confidence, and lift of the rules in descending order with the highest lift rules in regards to "bacon" at the top and lowest lift rules in relation to "bacon" at the bottom.
Using Orange3-Associate will likely be the route to go here; however, I cannot find any good examples on the web. Thanks for your help in advance!
Is this what you had in mind? Orange should be able to pass outputs from one add-on and use them as inputs in another.
[EDIT]
I was able to reconstruct the case in code, but it is far less sexy:
import numpy as np
from orangecontrib.text
import Corpus, preprocess, vectorization
from orangecontrib.associate.fpgrowth import *
data = Corpus.from_file("deerwester")
p = preprocess.Preprocessor()
preproc_corpus = p(data)
v = vectorization.bagofwords.BoWPreprocessTransform(p, "Count", preproc_corpus)
N = 30
X = np.random.random((N, 50)) > .9
itemsets = dict(frequent_itemsets(X, .1))
rules = association_rules(itemsets, .6)
list(rules_stats(rules, itemsets, N))

How to measure "probability" that string is some sort of code or nonsense

Let's assume that we have following strings:
q8GDNG8h029751
DNS
stackoverflow.com
28743.8.4.919
q7Q5w5dP012855
Martin_Luther
0000000100000000-0000000160000000
1344444967\.962
ExTreme_penguin
Obviously some of those can be, by our brain, classified as strings containing information, stings that have some "meaning" for humans. On the other hand, there are strings like "q7Q5w5dP012855" that are definitely some codes that could mean something only to computer.
My question is: Can we calculate some probability that string can actually tell us something?
I have some thoughts as doing frequency analysis or calculating capital letters etc. but it would be convenient to have something more 'scientific'
If you know the language that the strings are in you could use digram or trigram letter frequencies for the words in that language. These are quite small lookup tables [26 x 26]
or [26 x 26 x 26] each entry can be a floating point number which is the probability of that string occurring in the language. Many of these would be zero for meaningless string. You could add them up or simply count the number of zero probability sequences.
Of course this needs setting up for each language.

Find sentences with similar relative meaning from a list of sentences against an example one

I want to be able to find sentences with the same meaning. I have a query sentence, and a long list of millions of other sentences. Sentences are words, or a special type of word called a symbol which is just a type of word symbolizing some object being talked about.
For example, my query sentence is:
Example: add (x) to (y) giving (z)
There may be a list of sentences already existing in my database such as: 1. the sum of (x) and (y) is (z) 2. (x) plus (y) equals (z) 3. (x) multiplied by (y) does not equal (z) 4. (z) is the sum of (x) and (y)
The example should match the sentences in my database 1, 2, 4 but not 3. Also there should be some weight for the sentence matching.
Its not just math sentences, its any sentence which can be compared to any other sentence based upon the meaning of the words. I need some way to have a comparison between a sentence and many other sentences to find the ones with the closes relative meaning. I.e. mapping between sentences based upon their meaning.
Thanks! (the tag is language-design as I couldn't create any new tag)
First off: what you're trying to solve is a very hard problem. Depending on what's in your dataset, it may be AI-complete.
You'll need your program to know or learn that add, plus and sum refer to the same concept, while multiplies is a different concept. You may be able to do this by measuring distance between the words' synsets in WordNet/FrameNet, though your distance calculation will have to be quite refined if you don't want to find multiplies. Otherwise, you may want to manually establish some word-concept mappings (such as {'add' : 'addition', 'plus' : 'addition', 'sum' : 'addition', 'times' : 'multiplication'}).
If you want full sentence semantics, you will in addition have to parse the sentences and derive the meaning from the parse trees/dependency graphs. The Stanford parser is a popular choice for parsing.
You can also find inspiration for this problem in Question Answering research. There, a common approach is to parse sentences, then store fragments of the parse tree in an index and search for them by common search engines techniques (e.g. tf-idf, as implemented in Lucene). That will also give you a score for each sentence.
You will need to stem the words in your sentences down to a common synonym, and then compare those stems and use the ratio of stem matches in a sentence (5 out of 10 words) to compare against some threshold that the sentence is a match. For example all sentences with a word match of over 80% (or what ever percentage you deem acurate). At least that is one way to do it.
Write a function which creates some kinda hash, or "expression" from a sentence, which can be easy compared with other sentences' hashes.
Cca:
1. "the sum of (x) and (y) is (z)" => x + y = z
4. "(z) is the sum of (x) and (y)" => z = x + y
Some tips for the transformation: omit "the" words, convert double-word terms to a single word "sum of" => "sumof", find operator word and replace "and" with it.
Not that easy ^^
You should use a stopword filter first, to get non-information-bearing words out of it. Here are some good ones
Then you wanna handle synonyms. Thats actually a really complex theme, cause you need some kind of word sense disambiguation to do it. And most state of the art methods are just a little bit better then the easiest solution. That would be, that you take the most used meaning of a word. That you can do with WordNet. You can get synsets for a word, where all synonyms are in it. Then you can generalize that word (its called a hyperonym) and take the most used meaning and replace the search term with it.
Just to say it, handling synonyms is pretty hard in NLP. If you just wanna handle different wordforms like add and adding for example, you could use a stemmer, but no stemmer would help you to get from add to sum (wsd is the only way there)
And then you have different word orderings in your sentences, which shouldnt be ignored aswell, if you want exact answers (x+y=z is different from x+z=y). So you need word dependencies aswell, so you can see which words depend on each other. The Stanford Parser is actually the best for that task if you wanna use english.
Perhaps you should just get nouns and verbs out of a sentence and make all the preprocessing on them and ask for the dependencies in your search index.
A dependency would look like
x (sum, y)
y (sum, x)
sum (x, y)
which you could use for ur search
So you need to tokenize, generalize, get dependencies, filter unimportant words to get your result. And if you wanna do it in german, you need a word decompounder aswell.

Sorting a list of colors in one dimension?

I would like to sort a one-dimensional list of colors so that colors that a typical human would perceive as "like" each other are near each other.
Obviously this is a difficult or perhaps impossible problem to get "perfectly", since colors are typically described with three dimensions, but that doesn't mean that there aren't some sorting methods that look obviously more natural than others.
For example, sorting by RGB doesn't work very well, as it will sort in the following order, for example:
(1) R=254 G=0 B=0
(2) R=254 G=255 B=0
(3) R=255 G=0 B=0
(4) R=255 G=255 B=0
That is, it will alternate those colors red, yellow, red, yellow, with the two "reds" being essentially imperceivably different than each other, and the two yellows also being imperceivably different from each other.
But sorting by HLS works much better, generally speaking, and I think HSL even better than that; with either, the reds will be next to each other, and the yellows will be next to each other.
But HLS/HSL has some problems, too; things that people would perceive as "black" could be split far apart from each other, as could things that people would perceive as "white".
Again, I understand that I pretty much have to accept that there will be some splits like this; I'm just wondering if anyone has found a better way than HLS/HSL. And I'm aware that "better" is somewhat arbitrary; I mean "more natural to a typical human".
For example, a vague thought I've had, but have not yet tried, is perhaps "L is the most important thing if it is very high or very low", but otherwise it is the least important. Has anyone tried this? Has it worked well? What specifically did you decide "very low" and "very high" meant? And so on. Or has anyone found anything else that would improve upon HSL?
I should also note that I am aware that I can define a space-filling curve through the cube of colors, and order them one-dimensionally as they would be encountered while travelling along that curve. That would eliminate perceived discontinuities. However, it's not really what I want; I want decent overall large-scale groupings more than I want perfect small-scale groupings.
Thanks in advance for any help.
If you want to sort a list of colors in one dimension you first have to decide by what metrics you are going to sort them. The most sense to me is the perceived brightness (related question).
I have came across 4 algorithms to sort colors by brightness and compared them. Here is the result.
I generated colors in cycle where only about every 400th color was used. Each color is represented by 2x2 pixels, colors are sorted from darkest to lightest (left to right, top to bottom).
1st picture - Luminance (relative)
0.2126 * R + 0.7152 * G + 0.0722 * B
2nd picture - http://www.w3.org/TR/AERT#color-contrast
0.299 * R + 0.587 * G + 0.114 * B
3rd picture - HSP Color Model
sqrt(0.299 * R^2 + 0.587 * G^2 + 0.114 * B^2)
4td picture - WCAG 2.0 SC 1.4.3 relative luminance and contrast ratio formula
Pattern can be sometimes spotted on 1st and 2nd picture depending on the number of colors in one row. I never spotted any pattern on picture from 3rd or 4th algorithm.
If i had to choose i would go with algorithm number 3 since its much easier to implement and its about 33% faster than the 4th
You cannot do this without reducing the 3 color dimensions to a single measurement. There are many (infinite) ways of reducing this information, but it is not mathematically possible to do this in a way that ensures that two data points near each other on the reduced continuum will also be near each other in all three of their component color values. As a result, any formula of this type will potentially end up grouping dissimilar colors.
As you mentioned in your question, one way to sort of do this would be to fit a complex curve through the three-dimensional color space occupied by the data points you're trying to sort, and then reduce each data point to its nearest location on the curve and then to that point's distance along the curve. This would work, but in each case it would be a solution custom-tailored to a particular set of data points (rather than a generally applicable solution). It would also be relatively expensive (maybe), and simply wouldn't work on a data set that was not nicely distributed in a curved-line sort of way.
A simpler alternative (that would not work perfectly) would be to choose two "endpoint" colors, preferably on opposite sides of the color wheel. So, for example, you could choose Red as one endpoint color and Blue as the other. You would then convert each color data point to a value on a scale from 0 to 1, where a color that is highly Reddish would get a score near 0 and a color that is highly Bluish would get a score near 1. A score of .5 would indicate a color that either has no Red or Blue in it (a.k.a. Green) or else has equal amounts of Red and Blue (a.k.a. Purple). This approach isn't perfect, but it's the best you can do with this problem.
There are several standard techniques for reducing multiple dimensions to a single dimension with some notion of "proximity".
I think you should in particular check out the z-order transform.
You can implement a quick version of this by interleaving the bits of your three colour components, and sorting the colours based on this transformed value.
The following Java code should help you get started:
public static int zValue(int r, int g, int b) {
return split(r) + (split(g)<<1) + (split(b)<<2);
}
public static int split(int a) {
// split out the lowest 10 bits to lowest 30 bits
a=(a|(a<<12))&00014000377;
a=(a|(a<<8)) &00014170017;
a=(a|(a<<4)) &00303030303;
a=(a|(a<<2)) &01111111111;
return a;
}
There are two approaches you could take. The simple approach is to distil each colour into a single value, and the list of values can then be sorted. The complex approach would depend on all of the colours you have to sort; perhaps it would be an iterative solution that repeatedly shuffles the colours around trying to minimise the "energy" of the entire sequence.
My guess is that you want something simple and fast that looks "nice enough" (rather than trying to figure out the "optimum" aesthetic colour sort), so the simple approach is enough for you.
I'd say HSL is the way to go. Something like
sortValue = L * 5 + S * 2 + H
assuming that H, S and L are each in the range [0, 1].
Here's an idea I came up with after a couple of minutes' thought. It might be crap, or it might not even work at all, but I'll spit it out anyway.
Define a distance function on the space of colours, d(x, y) (where the inputs x and y are colours and the output is perhaps a floating-point number). The distance function you choose may not be terribly important. It might be the sum of the squares of the differences in R, G and B components, say, or it might be a polynomial in the differences in H, L and S components (with the components differently weighted according to how important you feel they are).
Then you calculate the "distance" of each colour in your list from each other, which effectively gives you a graph. Next you calculate the minimum spanning tree of your graph. Then you identify the longest path (with no backtracking) that exists in your MST. The endpoints of this path will be the endpoints of the final list. Next you try to "flatten" the tree into a line by bringing points in the "branches" off your path into the path itself.
Hmm. This might not work all that well if your MST ends up in the shape of a near-loop in colour space. But maybe any approach would have that problem.

Resources