probability logic statistics - statistics

I am not sure whether this is the right place to ask this question.
As this is more like a logic question.. but hey no harm in asking.
Suppose I have a huge list of data (customers)
and they all have a data_id
Now I want to select lets say split the data in ratio lets say 10:90 split.
Now rather than stating a condition that (example)
the sum of digits is even...go to bin 1
the sum of digits is odd.. go to bin 2
or sum of last three digits are x then go to bin 1
sum of last three digits is not x then go to bin 2
Now this might result in uneven data collection..sometimes it might be able to find the data.. more (which is fine) but sometimes it might not be able to find enough data
Is there a way (probabilistically speaking)
which says.. sample size is always greater than x%
Thanks

You want to partition your data by a feature that is uniformly distributed. Hash functions are designed to have this property ... so if you compute a hash of your customer ID, and then partition by the first n bits to get 2^n bins, each bin should have approximately the same number of items. (You can then select, say, 90% of your bins to get 90% of the data.) Hope this helps.

Related

Dividing vector into groups by another vector

I have the following table (for example, the real table is 100 rows):
It has a group name column, n students in group, and score.
I would like to build a forth column of cluster, which will divide the groups into 10 decimils of approximatly equal size, while preserving the score. So if i have a total number of 80 students in all groups together, than I'll have 10 clusters which have about 8 students each one more or less. the top cluster will consist the groups with the highest grade.
I hope it makes any sense.
My problem is more an algorithmic one, I prefer to have a solution in excel/vba other than R just because I need a more dinamic solution.
I tried to do it manually by sorting the groups by score, and then summing the n of students until i get a number close to the decimil of the total number of students, but maybe there is an algorithem more precise and less frustrating than that.
Thanks

how I can divide a sum number into 50 column in ecxel?

I would like to tell, with a smaller number of columns.
Let's say we have a sum of 24 and we want to distribute it randomly into 10 separate columns. we should get such a result as below I wrote.
Is there a formula in Excel like this?
Thanks in advance.
Ok. Here is what I would do...
For each required split, select a random number between zero and the remainder of the distribution qty, multiplied by the percentage of how many splits have already been calculated. This prevents the first few splits being very high, and the rest being zero.
I would also add a check for the very last split to make sure that it equals whatever is left of the original distribution qty.
Here is an image for illustration and the formula that I have used:
=IF($A2=MAX($A:$A),$F$1-SUM($B$1:$B1),RANDBETWEEN(0,($F$1-SUM($B$1:$B1))*($A2/MAX($A:$A))))
Hopefully, this isn't too complex to understand. If you need further explanation, please let me know.
You can simply change the distribution qty in the yellow box, and if you want more splits, all you need to do is drag down columns A & B to the required number.

How do I distribute a value over multiple cells evenly but under a maximum limit?

Example data with desired outcome that I need to calculate
I have 12 items of a certain current value. I have a 'soft' cap of $1,000,000 for these values. Some of the items fall above, and some below this cap level.
I have an amount of money (for this example $900,000) that I want to distribute amongst only the items that fall below the cap (in this example 6 items), with the aim of bringing the value of these items up to but not over the cap value.
If I distribute the $900,000 evenly over these 6 items (each receiving $150,000), you can see that items 2 and 9 would then be over the $1,000,000 cap. So items 2 and 9 should only receive $100,000 to raise their value to the cap, then the remaining 4 items would receive and equal share on the remaining pool of money ($700,000 / 4 = $175,000).
So I need a formula to check every item to see if it needs a distribution (i.e below the cap) and then portion/divide out the money pool as illustrated above in the desired distribution column.
Note: The pool of money to be distributed can change. Also the number of items below the cap can change. The cap value itself can change.
I am hoping to avoid VBA or Solver because the spreadsheet could be used on other people's computers.
Hopefully this makes sense. Thanks.
EDIT:
So far I have been able to get close by adding a helper column and using the following formula:
=IF(SUM($F$6:F14)=$D$23,0,E15*MIN(D15,($D$23-SUM($F$6:F14))/SUM(E15:$E$18)))
Working example when values are sorted.
This seems to work when the values are sorted in descending order, as shown in the example image above. But seems to break when the values are a bit more randomly assorted which is likely to happen (as in the original post).
Just to give you an idea of how the solver can be set up to do a capital budget model here is one, also shows the solver and its settings:

Find the 10 smallest numbers in a large dataset?

I am coding specifically in python, but right now in the phase of designing psuedocode for an algorithm that will take all the data points in a data set that has n values, n being very large and pick out the 10 smallest values (or finite number m << n, where m is the m smallest numbers). I wish to have an optimally efficient algorithm for the requirements.
My idea:
1) Heapsort the data then pick the smallest 10 values. O(nlog(n))
2) Alternatively,use a loop to identify a 'champion' that runs 10 times. With the first 'champion' determined remove from the dataset and then repeat this loop. O(n) (given m is small)
Which suggestion or if there is another would be best?
One approach among many possible:
Grab 10 values and sort them. Now compare the largest with the 11th through nth values one at a time. Whenever the new value is smaller replace the 10th smallest with it and resort your 10 values.
The list of 10 values, sorting them etc will all be in cache so fast even with rough code. The whole list will be accessed once through so will be fast as well.

Minimum cost to group same characters in a string

I got stuck in a problem. The overall problem statement is big. I have solved the other pieces of it.
Got stuck in one piece.
Given a string containing some dashes('-') and some character lets say ('A'). Also, we are given with cost C to shift a character to its adjacent place. We need to find minimum cost such that all 'A' characters are grouped.
Example1: A-A--A---A and cost = 10
Minimum cost to group all 'A's would be: 80
Example2: AAAA------A and cost = 10
Minimum cost to group all 'A's would be: 60
Hint: for the cost to be minimum possible, one of the median As (2nd or 3rd of 4 in your first example, 3rd of 5 in your second example) can be left in place. Using this, you can compute the cost in O(n), where n is either the length of the string or the number of As, whichever is your input format.
I don't think this problem needs dynamic-programming.
You only need to move all A's towards the median A because this is the least total distance between all A's.
Just make sure not to move the media A. If the A at the median is moved to the right, each of the A's to its left will have to move one more step and each of the A's to its right will have to move one step less. This should cancel out, but you already added one unneeded step.

Resources