Why can Hamming-Codes only detect 2 errors? - hamming-distance

I understand that the simplified reason is that the maximum Hamming-Distance is 3. But I do not understand why this is the case. Why can't we have a bigger Hamming distance, even for very large codes?

Related

Can a 2 sample statistical comparison have too large of a population size to be accurate?

I'm trying to do a simple comparison of two samples to determine if their means are different. Regardless of whether their standard deviations are equal/unequal, the formulas for a t-test or z-test are similar.
(i can't post images on a new account)
t-value w/ unequal variances:
https://www.biologyforlife.com/uploads/2/2/3/9/22392738/949234_orig.jpg
t-value w/ equal/pooled variances:
https://vitalflux.com/wp-content/uploads/2022/01/pooled-t-statistics-300x126.jpg
The issue here is the inverse and sqrt of sample size in the denominator that causes large samples to seem to have massive t-values.
For instance, I have 2 samples w/
size: N1=168,000 and N2=705,000
avgs: X1=89 and X2=49
stddev: S1=96 and S2=66 .
At first glance, these standard deviations are larger than the mean and suggest a nonhomogeneous sample with a lot of internal variation. When comparing the two samples, however, the denominator of the t-test becomes approx 0.25, suggesting that a 1 unit difference in means is equivalent to 4 standard deviations. Thus my t-value here comes out to around 160(!!)
All this to say, I'm just plugging in numbers since I didn't do many of these problems in advanced stats and haven't seen this formula since Stats110.
It makes some sense that two massive populations need their variance biased downward before comparing, but this seems like not the best test out there for the magnitude of what I'm doing.
What other tests are out there that I could try? What is the logic behind this seemingly over-biased variance?

Geometric Series - partial sum (processing efficiency)

so here is my situation. I have to solve a math problem on server end and could expect tens of thousands of requests a second so I'm trying to find the most efficient path to solving the problem.
Client will submit some number, let's call it A, and I need to determine base of the exponent in a geometric series (see below) where the result will be as close to A as possible without exceeding it.
The problem is that in the real-world, each value of the geometric series is rounded, so standard math can't apply.
round(x^1)+round(x^2)+round(x^3).
I can use the partial sum of geometric series equation to find some rough upper and lower limits using:
((x)^(n+1)-1)/((x)-1)
So say x=2 is a lower limit and x=2.03 is an upper limit... and the value i'm solving for is x=2.02392372838123.
So far the only solution i found was to use a recursive function to go through decimals individually testing until I find the number, but the load on the server is too high at the volume of requests I expect. (I am using node.js).
Does anyone have any thoughts or suggestions on a more efficient way to solve this? Again the only reason I can't solve this with math alone (to the best of my skill) is because of the real-world rounding of numbers in the sum.
Thanks.

Which Multivariate Statistic Test / Algorithm for Testing Statistical Significans

I'm looking for a mathematical algorithm to proof significances in multivariate testing.
E.g. Lets take website tests having 3 headlines, 2 images, 2 buttons test. This results in 3 x 2 x 2 = 12 variations:
h1-i1-b1, h2-i1-b1, h3-i1-b1,
h1-i2-b1, h2-i2-b1, h3-i2-b1,
h1-i1-b2, h2-i1-b2, h3-i1-b2,
h1-i2-b2, h2-i2-b2, h3-i2-b2.
The hypothesis is that one variation is better than others.
I'd like to to know with which significane one of the variations is the winner and how long I have to wait, that I can be sure that I have statistically a winner or at least have an indicator how sure I can be that one variation is the winner.
So basically I'd like to get a probability for each variation telling me wether it the winner or not. As the tests runs longer some variations drop in probability and the winner increases.
Which algorithm would you use? Whats the formula?
Are there any libs for this?
You can use a chi-square test. Your null hypothesis is that all outcomes are equally likely; when you plug in the measured counts for each of the 12 outcomes, you get out a number telling you the probability of getting a set of 12 counts as extreme (i.e. as far away from equally distributed) as this. If the probability is sufficiently small (typically < 5% or < 1%), you conclude that the null hypothesis was wrong.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Levenshtein cost settings

I've been asked to guess the user intention when part of expected data is missing. For example if I'm looking to get very well or not very well but I get only not instead, then I should flag it as not very well.
The Levenshtein distance for not and very well is 9 and the distance for not and not very well is 10. I think I'm actually trying to drive a screw with a wrench, but we have already agreed in our team to use Levenshtein for this case.
As you have seen the problem above, is there anyway if I can make some sense out of it by changing the insertion, replacement and deletion costs?
P.S. I'm not looking for a hack for this particular example. I want something that generally works as expected and outputs a better result in these cases also.
The Levenshtein distance for not and very well is actually 12. The alignment is:
------not
very well
So there are 6 insertions with a total cost of 6 (cost 1 for each insertion), and 3 replacements with a total cost of 6 (cost 2 for each replacement). The total cost is 12.
The Levenshtein distance for not and not very well is 10. The alignment is:
not----------
not very well
This includes only 10 insertions. So you can choose not very well as the best match.
The cost and alignment can be computed with htql for python:
import htql
a=htql.Align()
a.align('not', 'very well')
# (12.0, ['------not', 'very well'])
a.align('not', 'not very well')
# (10.0, ['not----------', 'not very well'])

Resources