How to compare the performance of 2 statistical samplers? - statistics

Say I have an orchard of 10 trees with a total of 1000 apples. Each tree in the orchard may have a different number of apples, or no apples at all. I have two bucket samplers: the first one is small, taking X% of the apples from a single tree at a time (i.e. per sample), and the second sampler is large, taking X% of the apples from two trees at a time. The X% of both samplers is identical, the sampling is without replacement (meaning that apples taken during in a specific sample remain out), and both samplers must make exactly 50 samples. Also, I'm assuming that the two samplers do not operate on the same orchard but each one of them has its own orchard, identical to the other one.
What I need is to compare the total amount of apples that each sampler yields after 50 samples. Naturally, if X=100%, both samplers will yield all the apples in the orchard and their performance will be the same; but how do I calculate the difference in their performance as a function of X% ?

I dont have a solution for you, but I suspect there is some lack of information or I misunderstand something. I will describe the problems below using X=100 as example.
N=10 is the total population size of trees
BS1 collects X% apples from 50 trees
BS2 collects X% apples from 100 trees
Let i denote the i'the tree for i=1,..,10. And let yi denote the number of apples on tree i. Lets assume each tree has a fixed unknown amount of apples. Ofcause you say there is 1000 apples in total meaning when all apples from 9 of the trees are sampled we know the number of apples on the 10th tree - but I will ignore this, and assume yi is completely unknown.
If X=100% and we sample once
BS1 collect pick a random? tree to collect from with some probability - lets say 1/10 for each tree. At the same time BS2 does the same for two trees. Assuming BS2 and BS1 cannot pick the same tree, and BS1 always pick first.
After the first sample, BS1 has picked yi apples and BS2 has picked yl+yj apples, for (i,j,l in 1,...10, and i \ne j\ne l).
After the third sample only one tree will be left with apples.
Since BS1 pick first and we sample without replacement, BS1 will pick 3+1 trees and BS2 will pick 6 trees Thus BS1 will always pick more then half of the trees (since he pick first). The number of apples picked will then be the depending on how the apples are distributed on the trees.

Related

estimate standard error from sample means

Random sample of 143 girl and 127 boys were selected from a large population.A measurement was taken of the haemoglobin level(measured in g/dl) of each child with the following result.
girl n=143 mean = 11.35 sd = 1.41
boys n=127 mean 11.01 sd =1.32
estimate the standard error of the difference between the sample means
In essence, we'd pool the standard errors by adding them. This implies that we´re answering the question: what is the vairation of the sampling distribution considering both samples?
SD = sqrt( (sd₁**2 / n₁) + (sd₂**2 / n₂) \
SD = sqrt( (1.41**2 / 143) + (1.32**2 / 127) ≈ 0.1662
Notice that the standrad deviation squared is simply the variance of each sample. As you can see, in our case the value is quite small, which indicates that the difference between sampled means doesn´t need to be that large for there to be a larger than expected difference between obervations.
We´d calculate the difference between means as 0.34 (or -0.34 depending on the nature of the question) and divide this difference by the standrad error to get a t-value. In our case 2.046 (or -2.046) indicates that the observed difference is 2.046 times larger than the average difference we would expect given the variation the variation that we measured AND the size of our sample.
However, we need to verify whether this observation is statistically significant by determining the t-critical value. This t-critical can be easily calculated by using a t-value chart: one needs to know the alpha (typically 0.05 unless otherwise stated), one needs to know the original alternative hypothesis (if it was something along the lines of there is a difference between genders then we would apply a two tailed distribution - if it was something along the lines of gender X has a hameglobin level larger/smaller than gender X then we would use a single tailed distribution).
If the t-value > t-critical then we would claim that the difference between means is statistically significant, thereby having sufficient evident to reject the null hypothesis. Alternatively, if t-value < t-critical, we would not have statistically significant evidence against the null hypothesis, thus we would fail to reject the null hypothesis.

Differences between Wallace Tree and Dadda Multipliers

Could anyone tell the difference in the partial products reduction method or mechanism between Wallace and Dadda multipliers ?
I have been reading A_comparison_of_Dadda_and_Wallace_multiplier_delays.pdf
Both are very similar. Instead of the traditional row based algorithm, they all aim to implement a multiplication A*B by 1/ anding A with bits b_i, 2/ counting bits for every column until there are only two rows and 3/ performing the final addition with a fast adder.
I worked on a Dadda multiplier, but this was many many years ago, and I am not sure to remember all the details. To my best knowledge, the main difference are in the counting process.
Wallace introduced the "Wallace tree" structure (that is still useful in some design). This allows, given n bits, to count the number of bits at 1 in this set. A (n,m) wallace tree (where m=ceil(log_2 n)) gives the number of bits at 1 among the n inputs and outputs the result on m bits. This is somehow a combinatorial counter. For instance, below is a the schematic of a (7,3) Wallace tree made with full adders (that are (3,2) Wallace trees).
As you can see, this tree generates results of logical weight 2^0, 2^1 and 2^2, if input bits are of weight 2^0.
This allows a fast reduction in the height of the columns, but can be somehow inefficient in terms of gate count.
Luigi Dadda do not use such an aggressive reduction strategy and tries to keep the columns heights more balanced. Only full (or half adders) are used and every counting/reduction will only generate bits of weight 2^0 and 2^1. The reduction process is less efficient (as can be seen by the larger number of rows in your figure), but the gate count is better. Dadda strategy was also supposed to be slightly less time efficient, but according to the enclosed paper, that I did not knew, it is not true.
The main interest in
Wallace/Dadda multipliers is that they can achieve a multiplication with ~log n time complexity, which is much better than the traditional O(n) array multiplier with carry save adders. But, despite this theoretical advantage, they are not really used any longer. Present architectures are more concerned with throughput than latency and prefer to use simpler array structures than can be efficiently pipelined. Implementing Wallace/Dadda structure is a real nightmare beyond a few bits and adding pipeline to them is very complex due to their irregular structure.
Note that other multiplier designs yield to log n time complexity, with a more regular and implementable divide-and-conquer strategy, as for instance the Luk-Vuillemin multiplier.

Flipping a three-sided coin

I have two related question on population statistics. I'm not a statistician, but would appreciate pointers to learn more.
I have a process that results from flipping a three sided coin (results: A, B, C) and I compute the statistic t=(A-C)/(A+B+C). In my problem, I have a set that randomly divides itself into sets X and Y, maybe uniformly, maybe not. I compute t for X and Y. I want to know whether the difference I observe in those two t values is likely due to chance or not.
Now if this were a simple binomial distribution (i.e., I'm just counting who ends up in X or Y), I'd know what to do: I compute n=|X|+|Y|, σ=sqrt(np(1-p)) (and I assume my p=.5), and then I compare to the normal distribution. So, for example, if I observed |X|=45 and |Y|=55, I'd say σ=5 and so I expect to have this variation from the mean μ=50 by chance 68.27% of the time. Alternately, I expect greater deviation from the mean 31.73% of the time.
There's an intermediate problem, which also interests me and which I think may help me understand the main problem, where I measure some property of members of A and B. Let's say 25% in A measure positive and 66% in B measure positive. (A and B aren't the same cardinality -- the selection process isn't uniform.) I would like to know if I expect this difference by chance.
As a first draft, I computed t as though it were measuring coin flips, but I'm pretty sure that's not actually right.
Any pointers on what the correct way to model this is?
First problem
For the three-sided coin problem, have a look at the multinomial distribution. It's the distribution to use for a "binomial" problem with more then 2 outcomes.
Here is the example from Wikipedia (https://en.wikipedia.org/wiki/Multinomial_distribution):
Suppose that in a three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample?
Note: Since we’re assuming that the voting population is large, it is reasonable and permissible to think of the probabilities as unchanging once a voter is selected for the sample. Technically speaking this is sampling without replacement, so the correct distribution is the multivariate hypergeometric distribution, but the distributions converge as the population grows large.
Second problem
The second problem seems to be a problem for cross-tabs. Then use the "Chi-squared test for association" to test whether there is a significant association between your variables. And use the "standardized residuals" of your cross-tab to identify which of the assiciations is more likely to occur and which is less likely.

How to interpret an unusual decision tree output (multi-classes) using rpart

I am trying to plot a decision tree using rpart package and really confused with its ouput. It is noted that at 3rd node, how can agriculture and mining classes be produced from urban?
I think it should be agriculture and urban instead of agriculture and mining.
Here is my code
df<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/Landsat_Data.csv")
library(rpart)
library(rpart.plot)
set.seed(123)
dt<-rpart(Land_cover~., data=df)
rpart.plot(dt,cex=0.35)
Please help me to explain it. Thank you
The nodes display the relative frequencies of all response categories along with the majority vote, i.e., the most frequent category. In case there are ties, the first of those most frequent categories is displayed as the majority vote (which is a somewhat arbitrary selection, of course).
Therefore, in the root node all categories occur with the equal frequency of 20% and "Agriculture" is displayed as the majority vote because it is lexicographically the first category.
Similarly, in node 3 (for Band1 >= 0.03599656) "Urban" and "Water" are still tied for the most frequent category (200 observations = 24.969%). And thus "Urban" is listed as the majority vote.

Which Multivariate Statistic Test / Algorithm for Testing Statistical Significans

I'm looking for a mathematical algorithm to proof significances in multivariate testing.
E.g. Lets take website tests having 3 headlines, 2 images, 2 buttons test. This results in 3 x 2 x 2 = 12 variations:
h1-i1-b1, h2-i1-b1, h3-i1-b1,
h1-i2-b1, h2-i2-b1, h3-i2-b1,
h1-i1-b2, h2-i1-b2, h3-i1-b2,
h1-i2-b2, h2-i2-b2, h3-i2-b2.
The hypothesis is that one variation is better than others.
I'd like to to know with which significane one of the variations is the winner and how long I have to wait, that I can be sure that I have statistically a winner or at least have an indicator how sure I can be that one variation is the winner.
So basically I'd like to get a probability for each variation telling me wether it the winner or not. As the tests runs longer some variations drop in probability and the winner increases.
Which algorithm would you use? Whats the formula?
Are there any libs for this?
You can use a chi-square test. Your null hypothesis is that all outcomes are equally likely; when you plug in the measured counts for each of the 12 outcomes, you get out a number telling you the probability of getting a set of 12 counts as extreme (i.e. as far away from equally distributed) as this. If the probability is sufficiently small (typically < 5% or < 1%), you conclude that the null hypothesis was wrong.

Resources