k-subset with maximal variance

k-subset with maximal variance - statistics

I have two versions of the same question :
1-Given a list of number (with possible duplicate ), how to find a k-subset (with possible duplicate) that maximise the variance ? is there a more efficient way than the obvious "check-all-k-subset" ?
2-Given a set of number , how do i select from that set a list of k number that maximise the variance.

It might be better to ask this on some Maths forum somewhere. Just a suggestion, you will get better answers there. The coding will be easy once you understand the algorithm, which is what you seem to be asking here.

Related

Framework for minimizing time complexity of generalized search

I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.
I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.
For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).
Sample Space: 329.5 million humans currently alive in the US
Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.
Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.
Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.
Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".
Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.
My questions for you guys are as follows:
If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.
If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?
Is my intuition even correct? Or are there faster ways to approach this?
What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?
Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?
What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?
Thx

Stats: Short Way to Find Pairs in a Set That Have Correlation Between Them

This is a general statistics question. Let's say I have a set consisting of several time series and I would like to find pairs that have correlation. Is the only way to do that is to compare each pair one by one? It does not seem so convenient since set may be very large.

It is the easiest and it is broadly implemented in software packages. This help to identify others variables linked.
For time series is very common to do it also with lagged and forward variables. This help to identify links in terms of the future and the past and the same variable and/or others.

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate

TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

Generating multiple optimal solutions using Excel solver

Is there a way to get all optimal solutions when you are solving some problem with Excel Solver (Simplex LP method)?
If not, what is the best way/add-in to Excel to solve it and convert existing VBA code to use this new way?

Actually, I have found a way to do this with Excel solver, although it is not optimal in sense of time consumption but that is not issue for me.
If you can assign unique id for each possible solution on some way, which is true in my case, then for each solution you find you can check if there is some solution with same value with different id on following way :
Find first optimal solution and save solution id and result. I will call this origID, origRes
Check if there is some solution with id < origID and res = origRes
If yes, then consider newId as initial id and continue with step 2 until you can't find solutions which satisfied criteria
After that, do the same thing with condition id > origID and res = origRes
After you make sure you found all solutions with optimal solution origRes, then we can go and find solution which is not optimal as origRes. I did it on a way to add condition that new solution needs to be <= (origRes - 0.01) because I know that all solutions will be with 2 decimal places.
Go to step 2 again
I know this is not the best way but I usually do not need more than 100 solutions and currently I can get it in 2 mins which is acceptable for me.

Although this looks easy, it actually is not such an easy question. Even the definition of "all possible optimal solutions" is not clear. There may by infinitely many of them. Asking for "all basic feasible solutions" (i.e. corner points) sounds better. To my knowledge there are no solvers providing this. I also do not know of a really simple technique to enumerate all optimal bases.
One interesting approach is to use a MIP formulation to enumerate all optimal bases:
Sangbum Lee, Chan Phalakornkule, Michael M. Domach, Ignacio E. Grossmann, "Recursive MILP model for finding all the alternate optima in LP
models for metabolic networks," Computers and Chemical Engineering 24 (2000) 711-716. (link)

String-to-string correction algorithm

I'm not sure I've titled this post correctly, but I'm wondering if there's a name for this type of algorithm:
What I'm trying to accomplish is to create a minimal set of instructions to go from one string to its permutation, so for example:
STACKOVERFLOW -> STAKCOVERFLOW
would require a minimum of one operation, which is to
shift K before C.
Are there any good online examples of
Finding the minimum instruction set (I believe this is also often called the edit distance), and
Listing the instruction set
Thanks!

There is something known as the Levenshtein distance that tells you how many changes are needed to go from one string to another and there are many C# implementations, many other languages too.
Here's the wiki:
http://en.wikipedia.org/wiki/Levenshtein_distance
Edit: As TheHorse has indicated, the Levenshtein distance doesn't understand Shift changes, but there is an improved algorithm:
Damerau-Levenshtein distance

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string