How to find the stability of a series of binary sequence

How to find the stability of a series of binary sequence - statistics

I am currently working on a project where I need to find the stability of multiple binary sequences of same length.
samples:
[1,1,1,1,1,1] and [0,0,0,0,0,0] are stable
[1,0,0,1,1,0] is comparatively less stable
[1,0,1,0,1,0] is least stable
How to find this mathematically with some score that can be used to compare against each other and the sequence can be ranked accordingly?

Based on your sample evaluation, you can probably create a reasonable score by counting how often the bit value changes to the next element, normalized by the length.
E.g. something like 1/(n-1) * sum ( abs(c[i] - c[i+1]) ) as a measure for the instability from 0 (stable) to 1 (least stable, all bits alternate).
If you want the value 1 to be the most stable, use 1-1/(n-1)*.... You may also want to define a value for lenght 1 and 0 according to your preference.

Related

Spark Hashing TF power of two feature dimension recommendation reasoning

According to https://spark.apache.org/docs/2.3.0/ml-features.html#tf-idf:
"HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3."
...
"Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices."
I tried to understand why using a power of two as the feature dimension will map words evenly and tried find some helpful documentation on the internet to understand it, but both attempts were not successful.
Does somebody know or have useful sources on why using the power two maps words evenly to vector indices?

The output of a hash function is b-bit, i.e., there are 2^b possible values to which a feature can be hashed. Additionally, we assume that the 2^b possible values appear uniformly at random.
If d is the feature dimension, an index for a feature f is determined as hash(f) MOD d. Again, hash(f) takes on 2^b possible values. It is easy to see that d has to be a power of two (i.e., a divisor of 2^b) itself in order for uniformity to be maintained.
For a counter-example, consider a 2-bit hash function and a 3-dimensional feature space. As per our assumptions, the hash function outputs 0, 1, 2, or 3 with probability 1/4 each. However, taking mod 3 results in 0 with probability 1/2, or 1 or 2 with probability 1/4 each. Therefore, uniformity is not maintained. On the other hand; if the feature space were 2-dimensional, it is easy to see that the result would be 0 or 1 with probability 1/2 each.

calculating reliability of measurements

I have many measurements of age of the same person. Let's say:
[23 25 32 23 25]
I would like to output a single value and a reliability score of this value. The single value can be the average.
Reliability, I don't know well how to calculate it. The value should be between 0 and 1, where 1 means all ages are equal and a very unreliable measurement should be near 0.
Probably the variance should be used here, but it's not clear to me how to normalize it between 0 and 1 in a meaningful way (1/(x+1) is not much meaningful :)).

Assume some probability distribution (or determine what probability distribution your data fits most accurately). A good choice is a normal distribution, which for discrete data requires a continuity correction. See example here: http://www.milefoot.com/math/stat/pdfc-normaldisc.htm
In your example, your reliability score for the average age of 26 (25.6 rounded to nearest integer), is simply the probability that X falls in the range (25.5, 26.5).

The easiest way for assessing reliability (or internal consistency) is to use Cronbach's alpha. I guess most statistics software has this method built-in.
https://en.wikipedia.org/wiki/Cronbach%27s_alpha

How to compare means of two sets when one set is a subset of another and the sample sizes are not

I have two sets containing citation counts for some publications. Of those sets one is a subset of the another. That is, subset contains some exact citation counts appearing on the other set. e.g.
Set1 Set2 (Subset)
50 50
24 24
12 -
5 5
4 4
43 43
2 -
2 -
1 -
1 -
So I want to decide if the numbers from the subset are good enough to represent set1? On this matter:
I have intended to apply student t-test but i could not be sure how
to apply it. The reason is that the sets are dependent so I could
not apply unpaired t-test requiring both sets must come from
independent populations. On the other hand, paired t-test also does
not look suitable since sample sizes must be equal.
In case of an outlier should I remove it? To me it is not logical
since it is not normally an outlier but a publication is cited quite a
lot so it belongs to the same sample. How to deal with such cases?
If I do not remove it, it causes the variance to be too big
affecting statistical tests...Is it a good idea to replace it with
median instead of mean since citation distributions generally tend
to be highly skewed?
How could I remedy this issue?

Levenshtein cost settings

I've been asked to guess the user intention when part of expected data is missing. For example if I'm looking to get very well or not very well but I get only not instead, then I should flag it as not very well.
The Levenshtein distance for not and very well is 9 and the distance for not and not very well is 10. I think I'm actually trying to drive a screw with a wrench, but we have already agreed in our team to use Levenshtein for this case.
As you have seen the problem above, is there anyway if I can make some sense out of it by changing the insertion, replacement and deletion costs?
P.S. I'm not looking for a hack for this particular example. I want something that generally works as expected and outputs a better result in these cases also.

The Levenshtein distance for not and very well is actually 12. The alignment is:
------not
very well
So there are 6 insertions with a total cost of 6 (cost 1 for each insertion), and 3 replacements with a total cost of 6 (cost 2 for each replacement). The total cost is 12.
The Levenshtein distance for not and not very well is 10. The alignment is:
not----------
not very well
This includes only 10 insertions. So you can choose not very well as the best match.
The cost and alignment can be computed with htql for python:
import htql
a=htql.Align()
a.align('not', 'very well')
# (12.0, ['------not', 'very well'])
a.align('not', 'not very well')
# (10.0, ['not----------', 'not very well'])

Non-Uniform Random Number Generator Implementation?

I need a random number generator that picks numbers over a specified range with a programmable mean.
For example, I need to pick numbers between 2 and 14 and I need the average of the random numbers to be 5.
I use random number generators a lot. Usually I just need a uniform distribution.
I don't even know what to call this type of distribution.
Thank you for any assistance or insight you can provide.

You might be able to use a binomial distribution, if you're happy with the shape of that distribution. Set n=12 and p=0.25. This will give you a value between 0 and 12 with a mean of 3. Just add 2 to each result to get the range and mean you are looking for.
Edit: As for implementation, you can probably find a library for your chosen language that supports non-uniform distributions (I've written one myself for Java).
A binomial distribution can be approximated fairly easily using a uniform RNG. Simply perform n trials and record the number of successes. So if you have n=10 and p=0.5, it's just like flipping a coin 10 times in a row and counting the number of heads. For p=0.25 just generate uniformly-distributed values between 0 and 3 and only count zeros as successes.
If you want a more efficient implementation, there is a clever algorithm hidden away in the exercises of volume 2 of Knuth's The Art of Computer Programming.

You haven't said what distribution you are after. Regarding your specific example, a function which produced a uniform distribution between 2 and 8 would satisfy your requirements, strictly as you have written them :)

If you want a non-uniform distribution of the random number, then you might have to implement some sort of mapping, e.g:
// returns a number between 0..5 with a custom distribution
int MyCustomDistribution()
{
int r = rand(100); // random number between 0..100
if (r < 10) return 1;
if (r < 30) return 2;
if (r < 42) return 3;
...
}

Based on the Wikipedia sub-article about non-uniform generators, it would seem you want to apply the output of a uniform pseudorandom number generator to an area distribution that meets the desired mean.

You can create a non-uniform PRNG from a uniform one. This makes sense, as you can imagine taking a uniform PRNG that returns 0,1,2 and create a new, non-uniform PRNG by returning 0 for values 0,1 and 1 for the value 2.
There is more to it if you want specific characteristics on the distribution of your new, non-uniform PRNG. This is covered on the Wikipedia page on PRNGs, and the Ziggurat algorithm is specifically mentioned.
With those clues you should be able to search up some code.

My first idea would be:
generate numbers in the range 0..1
scale to the range -9..9 ( x-0.5; x*18)
shift range by 5 -> -4 .. 14 (add 5)
truncate the range to 2..14 (discard numbers < 2)
that should give you numbers in the range you want.

You need a distributed / weighted random number generator. Here's a reference to get you started.

Assign all numbers equal probabilities,
while currentAverage not equal to intendedAverage (whithin possible margin)
pickedNumber = pick one of the possible numbers (at random, uniform probability, if you pick intendedAverage pick again)
if (pickedNumber is greater than intendedAverage and currentAverage<intendedAverage) or (pickedNumber is less than intendedAverage and currentAverage>intendedAverage)
increase pickedNumber's probability by delta at the expense of all others, conserving sum=100%
else
decrease pickedNumber's probability by delta to the benefit of all others, conserving sum=100%
end if
delta=0.98*delta (the rate of decrease of delta should probably be experimented with)
end while

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find the stability of a series of binary sequence - statistics

Related

Spark Hashing TF power of two feature dimension recommendation reasoning

calculating reliability of measurements

How to compare means of two sets when one set is a subset of another and the sample sizes are not

Levenshtein cost settings

Non-Uniform Random Number Generator Implementation?

Categories

Resources