I'm trying to compare 100,000 records on a local database (L) with 100,000 records on a remote database (R).
Basically I want to know if an element in L exists in R. To determine that, I have to make a request against the R for each L, which takes a long time (I know, there should be a better way, there isn't, that's the API I've got).
So I would like to test a small sample of L against R, and then infer with some level of confidence how many are present in the whole R. How many do I have to test to have a 99% confidence level?
If you test N records from your local database and all are in the remote database, you can estimate the probability of a local record not being in the remote database as being between 0 and 3/N. This is called the "rule of three" in statistics. I explain it here.
The only way to know that all records are in both databases is to test all of them. But if you test 100 records, for example, you can estimate that the proportion of records not in both databases is below 3%.
I would also suggest experimental design for estimating a proportion p.
Suppose that we are interested in estimating the proportion p of the elements in L that also exist in R and we would like to compute a 99% C.I. with a tolerance level (lvl) that is plus or minus 3%. A “conservative” estimate of the size of the random sample would be given by :
n = (Za/2)^2 / (4*lvl^2)
In R
CI<-.99
lvl<-.03
qnorm(1-(1-CI)/2,0,1)^2/(4*lvl^2)
[1] 1843.027
Check here for details
Is this a trick question? Its 99% right? After checking each one individually you'll know with 100% certainty whether or not its in the remote database, so if you want to check the whole database to 99% accuracy - you have to check 99% of the records (99,000).
Related
I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.
I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.
I am stuck in a statistics assignment, and would really appreciate some qualified help.
We have been given a data set and are then asked to find the 10% with the lowest rate of profit, in order to decide what Profit rate is the maximum in order to be considered for a program.
the data has:
Mean = 3,61
St. dev. = 8,38
I am thinking that i need to find the 10th percentile, and if i run the percentile function in excel it returns -4,71.
However I tried to run the numbers by hand using the z-score.
where z = -1,28
z=(x-μ)/σ
Solving for x
x= μ + z σ
x=3,61+(-1,28*8,38)=-7,116
My question is which of the two methods is the right one? if any at all.
I am thoroughly confused at this point, hope someone has the time to help.
Thank you
This is the assignment btw:
"The Danish government introduces a program for economic growth and will
help the 10 percent of the rms with the lowest rate of prot. What rate
of prot is the maximum in order to be considered for the program given
the mean and standard deviation found above and assuming that the data
is normally distributed?"
The excel formula is giving the actual, empirical 10th percentile value of your sample
If the data you have includes all possible instances of whatever you’re trying to measure, then go ahead and use that.
If you’re sampling from a population and your sample size is small, use a t distribution or increase your sample size. If your sample size is healthy and your data are normally distributed, use z scores.
Short story is the different outcomes suggest the data you’ve supplied are not normally distributed.
I'm trying to come up with a good design for a nearest neighbor search application. This would be somewhat similar to this question:
Saving and incrementally updating nearest-neighbor model in R
In my case this would be in Python but the main point being the part that when new data comes, the model / index must be updated. I'm currently playing around with scikit-learn neighbors module but I'm not convinced it's a good fit.
The goal of the application:
User comes in with a query and then the n (probably will be fixed to 5) nearest neighbors in the existing data set will be shown. For this step such a search structure from sklearn would help but that would have to be regenerated when adding new records.Also this is a first ste that happens 1 per query and hence could be somewhat "slow" as in 2-3 seconds compared to "instantly".
Then the user can click on one of the records and see that records nearest neighbors and so forth. This means we are now within the exiting dataset and the NNs could be precomputed and stored in redis (for now 200k records but can be expanded to 10th or 100th of millions). This should be very fast to browse around.
But here I would face the same problem of how to update the precomputed data without having to do a full recomputation of the distance matrix especially since there will be very few new records (like 100 per week).
Does such a tool, method or algorithm exist for updatable NN searching?
EDIT April, 3rd:
As is indicated in many places KDTree or BallTree isn't really suited for high-dimensional data. I've realized that for a Proof-of-concept with a small data set of 200k records and 512 dimensions, brute force isn't much slower at all, roughly 550ms vs 750ms.
However for large data set in millions+, the question remains unsolved. I've looked at datasketch LSH Forest but it seems in my case this simply is not accurate enough or I'm using it wrong. Will ask a separate question regarding this.
You should look into FAISS and its IVFPQ method
What you can do there is create multiple indexes for every update and merge them with the old one
You could try out Milvus that supports adding and near real-time search of vectors.
Here are the benchmarks of Milvus.
nmslib supports adding new vectors. It's used by OpenSearch as part their Similarity Search Engine, and it's very fast.
One caveat:
While the HNSW algorithm allows incremental addition of points, it forbids deletion and modification of indexed points.
You can also look into solutions like Milvus or Vearch.
I'm trying to work out how to solve what seems like a simple problem, but I can't convince myself of the correct method.
I have time-series data that represents the pdf of a Power output (P), varying over time, also the cdf and quantile functions - f(P,t), F(P,t) and q(p,t). I need to find the pdf, cdf and quantile function for the Energy in a given time interval [t1,t2] from this data - say e(), E(), and qe().
Clearly energy is the integral of the power over [t1,t2], but how do I best calculate e, E and qe ?
My best guess is that since q(p,t) is a power, I should generate qe by integrating q over the time interval, and then calculate the other distributions from that.
Is it as simple as that, or do I need to get to grips with stochastic calculus ?
Additional details for clarification
The data we're getting is a time-series of 'black-box' forecasts for f(P), F(P),q(P) for each time t, where P is the instantaneous power and there will be around 100 forecasts for the interval I'd like to get the e(P) for. By 'Black-box' I mean that there will be a function I can call to evaluate f,F,q for P, but I don't know the underlying distribution.
The black-box functions are almost certainly interpolating output data from the model that produces the power forecasts, but we don't have access to that. I would guess that it won't be anything straightforward, since it comes from a chain of non-linear transformations. It's actually wind farm production forecasts: the wind speeds may be normally distributed, but multiple terrain and turbine transformations will change that.
Further clarification
(I've edited the original text to remove confusing variable names in the energy distribution functions.)
The forecasts will be provided as follows:
The interval [t1,t2] that we need e, E and qe for is sub-divided into 100 (say) sub-intervals k=1...100. For each k we are given a distinct f(P), call them f_k(P). We need to calculate the energy distributions for the interval from this set of f_k(P).
Thanks for the clarification. From what I can tell, you don't have enough information to solve this problem properly. Specifically, you need to have some estimate of the dependence of power from one time step to the next. The longer the time step, the less the dependence; if the steps are long enough, power might be approximately independent from one step to the next, which would be good news because that would simplify the analysis quite a bit. So, how long are the time steps? An hour? A minute? A day?
If the time steps are long enough to be independent, the distribution of energy is the distribution of 100 variables, which will be very nearly normally distributed by the central limit theorem. It's easy to work out the mean and variance of the total energy in this case.
Otherwise, the distribution will be some more complicated result. My guess is that the variance as estimated by the independent-steps approach will be too big -- the actual variance would be somewhat less, I believe.
From what you say, you don't have any information about temporal dependence. Maybe you can find or derive from some other source or sources an estimate the autocorrelation function -- I wouldn't be surprised if that question has already been studied for wind power. I also wouldn't be surprised if a general version of this problem has already been studied -- perhaps you can search for something like "distribution of a sum of autocorrelated variables." You might get some interest in that question on stats.stackexchange.com.
I hava 2000 points with 5000 dimensions , and I want to get the nearest neighbour.
Now I have some problems , could anybody give a answer.
People say , it works good with high dimensions. What's the time complexity ?
#param max_nn_chks search is cut off after examining this many tree entries
After I read the algorithm, I wonder if I would get the wrong answer when I set the max_nn_chks too low. If yes, then just tell me how to set this parameter, else give a reason, thanks.
Is the kdtree the best Data Structures for my data to get nearest neighbour?
The time complexity is basically the same as in restricted KD-Tree search plus some little time to maintain the priority queue. The restricted KD-Tree search algorithm needs to traverse the tree in its full depth (log2 of the point count) times the limit (maximum number of leaf nodes/points allowed to be visited).
Yes, you will get a wrong answer if the limit is too low. You can only measure fraction of true NN found versus number of leaf nodes searched. From this, you can determine your optimal value.
Usually a randomized kd-tree forest and hierarchical k-means tree perform best. FLANN provides a method to determine which algorithm to use (k-means vs randomized kd-tree forest) and sets the optimal parameters for you.
The structure of data also has a big impact. If you know there are clusters of points being close together, for example, you can group them in a single node of a tree (represent them by their centroid, for example) and speed up the search.
Another techniques such as visual words, PCA or random projections can be employed on the data. It's a quite active field of research.