Entropy and Information Gain - statistics

Simple question I hope.
If I have a set of data like this:
Classification attribute-1 attribute-2
Correct dog dog
Correct dog dog
Wrong dog cat
Correct cat cat
Wrong cat dog
Wrong cat dog
Then what is the information gain of attribute-2 relative to attribute-1?
I've computed the entropy of the whole data set: -(3/6)log2(3/6)-(3/6)log2(3/6)=1
Then I'm stuck! I think you need to calculate entropies of attribute-1 and attribute-2 too? Then use these three calculations in an information gain calculation?
Any help would be great,
Thank you :).

Well first you have to calculate the entropy for each of the attributes. After that you calculate the information gain. Just give me a moment and I'll show how it should be done.
for attribute-1
attr-1=dog:
info([2c,1w])=entropy(2/3,1/3)
attr-1=cat
info([1c,2w])=entropy(1/3,2/3)
Value for attribute-1:
info([2c,1w],[1c,2w])=(3/6)*info([2c,1w])+(3/6)*info([1c,2w])
Gain for attribute-1:
gain("attr-1")=info[3c,3w]-info([2c,1w],[1c,2w])
And you have to do the same for the next attribute.

Related

Creating a "signature" based on data?

Signature: "a distinctive pattern, product, or characteristic by which someone or something can be identified."
I'm creating a neural net that takes songs as inputs and outputs 500 values: similarities with the top artists. Like:
1. 25% like Muse
2. 23% like the Arctic Monkeys
3. 20% like Imagine Dragons, etc
...
500. 0% like Beethoven
I'm wanting to create some type of "signature" from this output, with which I could, hopefully, do some interesting things (programmatically).
Anyone have any ideas?
I should also say, I already have plans for this output. I want to do approximate nearest neighbors to recommend (submitted) songs based on this output. But I want to do more interesting things

Difference between adequacy and fluency in ngram

"When 1-gram precision is high, the reference tends to satisfy
adequacy.
When longer n-gram precision is high, the reference tends to account
for fluency."
What does this mean?
Adequacy: How much of the source information is preserved?
Fluency: How good is the generated target language quality?
For Machine Translation task,
English(S-V-O): A dog chased a cat
Reference translation in Hindi would look like:
Hindi(S-O-V): A dog a cat chased
When 1-gram precision is high, correct word-to-word translations will be high but order of those words might not be correct in translated sentence. But still, majority of the source information is preserved.
High 1-gram, low N-gram(2-gram) precision value example: chased dog cat
When N-gram precision is high, order of those words will be preserved to some extent and you would get a fluent sentence in Hindi.
High N-gram(2-gram) precision example: A dog cat chased

Unknown Words in N-Gram Modelling

What is the logic for grouping unknown words under the same token i.e., <UNK> and also include words with small probabilities?
Won't some rare words get assigned high probabilities if the <UNK> set grows in size?
This might work if all the <UNK> words belong to the same class in some sense, for example, proper nouns such as John, Tim, Sam can all use each other's probability as bi-grams "Hello John, Hello Tim, Hello Sam" are equally likely. But if this is not the case, won't this method run into problems?
Mapping rare words to <UNK> simply means that we delete those words and replace them with the token <UNK> in the training data. Thus our model does not know of any rare words. It is a crude form of smoothing because the model assumes that the token <UNK> will never actually occur in real data or better yet it ignores these n-grams altogether.
The problem that smoothing is trying to solve is data sparsity. This technique is probably the simplest way to deal with it. However, we can do better as #alvas shows in the comments.

Document Vectorization Representation in Python

I was trying my hand at sentiment analysis in python 3, and was using the TDF-IDF vectorizer with the bag-of-words model to vectorize a document.
So, to anyone who is familiar with that, it is quite evident that the resulting matrix representation is sparse.
Here is a snippet of my code. Firstly, the documents.
tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1),
('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0),
('I was shocked because no signs indicate cash only.',0),
('Waitress was a little slow in service.',0),
('did not like at all',0),('The food, amazing.',1),
('The burger is good beef, cooked just right.',1),
('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0),
('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0),
('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0),
('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0),
('This is the place where I first had pho and it was amazing!!',1),
('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1),
('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0),
('We literally sat there for 20 minutes with no one asking to take our order.',0),
('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)]
X_train, y_train = zip(*tweets)
And the following code to vectorize the documents.
tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)
print(vectorized)
When I print vectorized, it does not output a normal matrix. Instead, this:
If I'm not wrong, this must be a sparse matrix representation. However, I am not able to comprehend its format, and what each term means.
Also, there are 30 documents. So, that explains the 0-29 on the first column. If that's the trend then I'm guessing the second column is the index of the words, and the last value is it's tf-idf? It just struck me while I was typing my question, but kindly correct me if I'm wrong.
Could anyone with experience in this help me understand it better?
Yes, technically the first two tuples represent the row-column position, and the third column is the value in that position. So it is basically showing the position and values of the nonzero values.

Issues in using lda for vowpal wabbit

I am trying to use the vowpal wabbit lda model. But I am having very bad results. I think there is something wrong with the process I am doing. I have this vocabulary size of 100000.
I run the code like this
vw --data train.txt --lda 50 --lda_alpha 0.1 --lda_rho 0.1 --lda_D 262726 -b 20 -pions.dat --readable_model wordtopics.dat
Now I was expecting the wordtopics.dat file to contain the topic proportions for those 100000 words but it looks that this word topics.dat file is very huge contains like 1048587 lines.
I think it is because of b = 20, and the lines at the end are like having uniform probability distribution.
However, when I look at the topics obtained they do not make sense at all. So I think something is wrong. What could go wrong guys?
Not answering your question, but guys at Columbia University Applied Data Science have made a helper to work with VW's LDA, especially on viewing the results.
Also try to use --passes option so VW result can be better over some number of training.

Resources