So, usually for unbiased coins, the probability of getting 2 heads out of 3 flips is - 3C2 * 1/2 * 1/2 * 1/2 = 3/8, since we know, the formula for probability is likely events divided by all possible events; we can say that there are 8 possible events here.
Now flip an unbiased coin with the probability of getting heads 80% of the time,
so the probability of getting 2 heads out of 3 flips is -
3C2 * 0.8 * 0.8 * 0.2 = 3/7.8125, so is the sample space 7.8125 here ?
It is still 8. 8 possible results. It's all about classical definition of probability.
In first example (p=50%) each possible result (for example {head, head, not_head}) has the same probability, that's why we can calculate
**total_prob = count_success/count_total = 3*1.000/8 = 0.375**
In the second (p=80%) we don't have this assumption anymore, so cannot use classical definition of probability (count_success/count_total), so we have to calculate
**total_prob = sum_success/count_total = 3*1.024/8 = 0.384**
In general, You can imagine, that in 1st example each result has weight=1.000, and in 2nd example results have different weights (for example {head, head, not_head} has weight=1.024 and {not_head, not_head, not_head} has weight=0.064)
Related
Say I want to test if a coin is fair.
An experiment is performed to determine whether a coin flip is fair (50% chance
of landing heads or tails) or unfairly biased, either toward heads (> 50% chance of
landing heads) or toward tails (< 50% chance of landing heads). Since we consider both biased alternatives, a two-tailed test is performed.
H0 = Coin is fair
H1 = Coin is unfair
Here is the experiment result: 10 Heads and 10 Tails
Then I calculate the probability of events assume H0 the coin is fair
The probability of exactly, or more than, 10 Heads out of 20 tosses is p = .588
By symmetry, the probability of exactly, or more than, 10 Tails out of 20 tosses is the same, .588
Thus, the p-value for the coin turning up the same face 10 times out of 20 total flips is .588 + .588 = 1.176 > 1
But p-value cannot be larger than 1, may I know what is wrong here?
Ref:
PROBABILITY VALUE (p-Value)
Binomial Test Calculator
Case of 10 Heads + 10 Tails is accounted for in both probabilities.
You can see that P(10T) = 0.176, and 0.588 + 0.588 = 1.176 = 1 + P(10T + 10H)
The general relationship for summing event probabilities is
P(A∪B) = P(A) + P(B) - P(A∩B)
Expressed in words, you can only use a simple sum of probabilities for disjoint events.
Your events are A = {#Heads ≥ 10} and B = {#Tails ≥ 10}. {#Tails ≥ 10} => {#Heads ≤ 10}, which makes it clear that the two events are not disjoint because they both include the outcome {#Heads = 10}. Your claim that the probability exceeds 1 fails because you've neglected that intersection term, which has P{#Heads = 10} = 0.176.
I want my sigmoid to never print a solid 1 or 0, but to actually print the exact value
i tried using
torch.set_printoptions(precision=20)
but it didn't work. here's a sample output of the sigmoid function :
before sigmoid : tensor([[21.2955703735]])
after sigmoid : tensor([[1.]])
but i don't want it to print 1, i want it to print the exact number, how can i force this?
The difference between 1 and the exact value of sigmoid(21.2955703735) is on the order of 5e-10, which is significantly less than machine epsilon for float32 (which is about 1.19e-7). Therefore 1.0 is the best approximation that can be achieved with the default precision. You can cast your tensor to a float64 (AKA double precision) tensor to get a more precise estimate.
torch.set_printoptions(precision=20)
x = torch.tensor([21.2955703735])
result = torch.sigmoid(x.to(dtype=torch.float64))
print(result)
which results in
tensor([0.99999999943577644324], dtype=torch.float64)
Keep in mind that even with 64-bit floating point computation this is only accurate to about 6 digits past the last 9 (and will be even less precise for larger sigmoid inputs). A better way to represent numbers very close to one is to directly compute the difference between 1 and the value. In this case 1 - sigmoid(x) which is equivalent to 1 / (1 + exp(x)) or sigmoid(-x). For example,
x = torch.tensor([21.2955703735])
delta = torch.sigmoid(-x.to(dtype=torch.float64))
print(f'sigmoid({x.item()}) = 1 - {delta.item()}')
results in
sigmoid(21.295570373535156) = 1 - 5.642236648842976e-10
and is a more accurate representation of your desired result (though still not exact).
I am implementing the Skipgram model, both in Pytorch and Tensorflow2. I am having doubts about the implementation of subsampling of frequent words. Verbatim from the paper, the probability of subsampling word wi is computed as
where t is a custom threshold (usually, a small value such as 0.0001) and f is the frequency of the word in the document. Although the authors implemented it in a different, but almost equivalent way, let's stick with this definition.
When computing the P(wi), we can end up with negative values. For example, assume we have 100 words, and one of them appears extremely more often than others (as it is the case for my dataset).
import numpy as np
import seaborn as sns
np.random.seed(12345)
# generate counts in [1, 20]
counts = np.random.randint(low=1, high=20, size=99)
# add an extremely bigger count
counts = np.insert(counts, 0, 100000)
# compute frequencies
f = counts/counts.sum()
# define threshold as in paper
t = 0.0001
# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
sns.distplot(probs);
Q: What is the correct way to implement subsampling using this "probability"?
As an additional info, I have seen that in keras the function keras.preprocessing.sequence.make_sampling_table takes a different approach:
def make_sampling_table(size, sampling_factor=1e-5):
"""Generates a word rank-based probabilistic sampling table.
Used for generating the `sampling_table` argument for `skipgrams`.
`sampling_table[i]` is the probability of sampling
the i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
```
p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
```
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
where `gamma` is the Euler-Mascheroni constant.
# Arguments
size: Int, number of possible words to sample.
sampling_factor: The sampling factor in the word2vec formula.
# Returns
A 1D Numpy array of length `size` where the ith entry
is the probability that a word of rank i should be sampled.
"""
gamma = 0.577
rank = np.arange(size)
rank[0] = 1
inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
f = sampling_factor * inv_fq
return np.minimum(1., f / np.sqrt(f))
I tend to trust deployed code more than paper write-ups, especially in a case like word2vec, where the original authors' word2vec.c code released by the paper's authors has been widely used & served as the template for other implementations. If we look at its subsampling mechanism...
if (sample > 0) {
real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
next_random = next_random * (unsigned long long)25214903917 + 11;
if (ran < (next_random & 0xFFFF) / (real)65536) continue;
}
...we see that those words with tiny counts (.cn) that could give negative values in the original formula instead here give values greater-than 1.0, and thus can never be less than the long-random-masked-and-scaled to never be more than 1.0 ((next_random & 0xFFFF) / (real)65536). So, it seems the authors' intent was for all negative-values of the original formula to mean "never discard".
As per the keras make_sampling_table() comment & implementation, they're not consulting the actual word-frequencies at all. Instead, they're assuming a Zipf-like distribution based on word-rank order to synthesize a simulated word-frequency.
If their assumptions were to hold – the related words are from a natural-language corpus with a Zipf-like frequency-distribution – then I'd expect their sampling probabilities to be close to down-sampling probabilities that would have been calculated from true frequency information. And that's probably "close enough" for most purposes.
I'm not sure why they chose this approximation. Perhaps other aspects of their usual processes have not maintained true frequencies through to this step, and they're expecting to always be working with natural-language texts, where the assumed frequencies will be generally true.
(As luck would have it, and because people often want to impute frequencies to public sets of word-vectors which have dropped the true counts but are still sorted from most- to least-frequent, just a few days ago I wrote an answer about simulating a fake-but-plausible distribution using Zipf's law – similar to what this keras code is doing.)
But, if you're working with data that doesn't match their assumptions (as with your synthetic or described datasets), their sampling-probabilities will be quite different than what you would calculate yourself, with any form of the original formula that uses true word frequencies.
In particular, imagine a distribution with one token a million times, then a hundred tokens all appearing just 10 times each. Those hundred tokens' order in the "rank" list is arbitrary – truly, they're all tied in frequency. But the simulation-based approach, by fitting a Zipfian distribution on that ordering, will in fact be sampling each of them very differently. The one 10-occurrence word lucky enough to be in the 2nd rank position will be far more downsampled, as if it were far more frequent. And the 1st-rank "tall head" value, by having its true frequency *under-*approximated, will be less down-sampled than otherwise. Neither of those effects seem beneficial, or in the spirit of the frequent-word-downsampling option - which should only "thin out" very-frequent words, and in all cases leave words of the same frequency as each other in the original corpus roughly equivalently present to each other in the down-sampled corpus.
So for your case, I would go with the original formula (probability-of-discarding-that-requires-special-handling-of-negative-values), or the word2vec.c practical/inverted implementation (probability-of-keeping-that-saturates-at-1.0), rather than the keras-style approximation.
(As a totally-separate note that nonetheless may be relevant for your dataset/purposes, if you're using negative-sampling: there's another parameter controlling the relative sampling of negative examples, often fixed at 0.75 in early implementations, that one paper has suggested can usefully vary for non-natural-language token distributions & recommendation-related end-uses. This parameter is named ns_exponent in the Python gensim implementation, but simply a fixed power value internal to a sampling-table pre-calculation in the original word2vec.c code.)
I am trying to create a module with a mathematical class for Taylor series, to have it easily accessible for other projects. Hence I wish to optimize it as far as I can.
For those who are not too familiar with Taylor series, it will be a necessity to be able to differentiate a function in a point many times. Given that the normal definition of the mathematical derivative of a function will require immense precision for higher order derivatives, I've decided to use Cauchy's integral formula instead. With a little bit of work, I've managed to rearrange the formula a little bit, as you can see on this picture: Rearranged formula. This provided me with much more accurate results on higher order derivatives than the traditional definition of the derivative. Here is the function i am currently using to differentiate a function in a point:
def myDerivative(f, x, dTheta, degree):
riemannSum = 0
theta = 0
while theta < 2*np.pi:
functionArgument = np.complex128(x + np.exp(1j*theta))
secondFactor = np.complex128(np.exp(-1j * degree * theta))
riemannSum += f(functionArgument) * secondFactor * dTheta
theta += dTheta
return factorial(degree)/(2*np.pi) * riemannSum.real
I've tested this function in my main function with a carefully thought out mathematical function which I know the derivatives of, namely f(x) = sin(x).
def main():
print(myDerivative(f, 0, 2*np.pi/(4*4096), 16))
pass
These derivatives seems to freak out at around the derivative of degree 16. I've also tried to play around with dTheta, but with no luck. I would like to have higher orders as well, but I fear I've run into some kind of machine precission.
My question is in it's simplest form: What can I do to improve this function in order to get higher order of my derivatives?
I seem to have come up with a solution to the problem. I did this by rearranging Cauchy's integral formula in a different way, by exploiting that the initial contour integral can be an arbitrarily large circle around the point of differentiation. Be aware that it is very important that the function is analytic in the complex plane for this to be valid.
New formula
Also this gives a new function for differentiation:
def myDerivative(f, x, dTheta, degree, contourRadius):
riemannSum = 0
theta = 0
while theta < 2*np.pi:
functionArgument = np.complex128(x + contourRadius*np.exp(1j*theta))
secondFactor = (1/contourRadius)**degree*np.complex128(np.exp(-1j * degree * theta))
riemannSum += f(functionArgument) * secondFactor * dTheta
theta += dTheta
return factorial(degree) * riemannSum.real / (2*np.pi)
This gives me a very accurate differentiation of high orders. For instance I am able to differentiate f(x)=e^x 50 times without a problem.
Well, since you are working with a discrete approximation of the derivative (via dTheta), sooner or later you must run into trouble. I'm surprised you were able to get at least 15 accurate derivatives -- good work! But to get derivatives of all orders, either you have to put a limit on what you're willing to accept and say it's good enough, or else compute the derivatives symbolically. Take a look at Sympy for that. Sympy probably has some functions for computing Taylor series too.
I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:
age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.
Can anyone tell me how to go about this problem or point me to some resources?
Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.
For example
x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)
library(cluster)
daisy(x, metric = "euclidean")
you'll get :
Dissimilarities :
1 2 3 4
2 2.000000
3 3.316625 2.236068
4 2.236068 1.732051 1.414214
5 4.242641 3.741657 1.732051 2.645751
If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this
Give each attribute an appropriate weight, and add the differences between values.
enum SkinType
Dry, Medium, Oily
enum HairLength
Bald, Short, Medium, Long
UserDifference(user1, user2)
total := 0
total += abs(user1.Age - user2.Age) * 0.1
total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
# etc...
return total
If you really need similarity instead of difference, use 1 / UserDifference(a, b)
You probably should take a look for
Data Mining and Data Warehousing (Essential)
Machine Learning (Extra)
Artificial Neural Networks (Especially SOM)
Pattern Recognition (Related)
These topics will let you your program recognize similarities and clusters in your users collection and try to adapt to them...
You can then know different hidden common groups of related users... (i.e users with green hair usually do not like watching TV..)
As an advice, try to use ready implemented tools for this feature instead of implementing it yourself...
Take a look at Open Directory Data Mining Projects
Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:
Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.
Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):
double Difference(Person p1, Person p2) {
double agescale=0.1;
double skinscale=0.5;
double hairscale=1;
double lifestylescale=1;
double agediff = (p1.age-p2.age)*agescale;
double skindiff = (p1.skin-p2.skin)*skinscale;
double hairdiff = (p1.hair-p2.hair)*hairscale;
double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;
double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
return diff;
}
Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.
Look at algorithms for computing srting difference. Its very similar to what you need. Store your attributes as a bit string and compute the distance between the strings
You should read these two topics.
Most popular clustering algorithm k - means
And similarity matrix are essential in clustering