Efficient methods for determining the largest set of complete data in a large dataset

Efficient methods for determining the largest set of complete data in a large dataset - subset

I have a large-ish dataset (say 10mil rows by 1500 columns). Each row represents an individual and each column represents a question. I would like to find the largest set of non-missing data (i.e., n rows with k columns of complete data, subject to some criteria (n>N)). Currently, I am doing something that feels a bit arbitrary - I start by ranking the columns by completeness and using the column (C1) with the largest number of completions (non missing rows) as my starting point. I filter out rows with missing data for C1, re-rank the remaining columns based on completeness, choose the top column (C2) with highest number of completions and continue down this path until I reach a set size I am comfortable with (stop when n < N).
I would be very interested if there are methods in place for doing this and/or any thoughts on efficient ways to do this!
Thank you

Related

Find row which has one cell similar and other cell different than in another row

Let's say I have this:
A B
1 10 20
2 12 30
3 25 15
4 40 30
How do I find the row which have same value in column B and different value for column A when compared to all the rows above or below ?
I want to find this cell:
A2:B2

Update: NO revision necessary
Following feedback I have tested this equation (below) with 20k rows (link below) - happy to report back results as expected/all still in order. No changes necessary/warranted. This function works just fine/as expected. Beaut!
Explanation:
When testing large samples of data of type 'integer' (say) that range a common order of magnitude/size (i.e. have material probability of re-occurring), the probability of obtaining a unique value for field A (col B, below screenshot) reduces, due to the law of large numbers (variance is what leads to unique values, and this reduces as the sample size increases).
As a consequence, one may encounter results = !Calc# which simply means 'no unique values could be found in col A (or they could but only for when col C was also unique - although the probability of this is remote, it's mainly due to numerous other cells in respective columns containing identical data. Throw a negative 100 in column A (assuming all other values are positive real/integer number plane), and you should see my eqn. below return '-100' and whatever the corresponding 'col-C' data is (assuming that is not unique too, as I have mentioned)...
NOW - back to the solution already! :)
ORIGINAL SOLN:
This will give you back every such combination (besides {12,30} there is also {40,30}):
=FILTER(B2:B5&"-"&C2:C5,(COUNTIFS(B2:B5,"="&$B$2:$B$5)=1)*(COUNTIFS(C2:C5,"="&$C$2:$C$5)>1))
OneDrive excel-linked spreadsheet for your convenience here, taking careful note of restrictions per 1st comment to this proposed soln.
Screenshot
Notes
Assumes you have Office 365 version of Excel

generate all possibilities from two fixed rows of entries?

I spent hours trying to look for a solution and I feel like I got close but figured asking would be the best way.
Lets say I have a table with 2 columns, column A is an item, and column B is a price for the item. This table has 12 entries. What I would like to do is generate additional tables of 6 entries that do not exceed a certain price. see below for example. The number i want these table to not exceed is 50,000.
for example the first entry could be an apple at 9,000 value. the apple is column a, and value column b.
Can someone help with a way to generate all combinations of 6 items from column a, that do not exceed a combined price of 50,000 in column b?

With 12 items you have 212-1 or 4095 possible combinations of products. These can map into the 12 bits of a 12-bit binary number. It is not difficult to write a macro to calculate the total cost of each combination and then filter the result to display results less than or equal to 50,000.
EDIT#1:
Please see:
Best possible combination sum of predefined numbers that smaller or equal NN
Listing all possible combination without repetition,VBA

How to create four equal buckets of decimal values

I have an excel table:
JobA .03445
JobB .01366
JobC .93271
JobD .6335
Plus 65,000 more.
What I need to do, is to create four equal buckets based on the values. where the sum of all Jobs in each bucket come as close to the other three buckets as possible.
Is there a way to do this in Excel?
Thanks

You can try this approach based on the incremental percentage. So you sum each incremental job until your sum reaches 25% of total values (that is BucketA), jobs from 25-50% will be "BucketB", 50-75% "BucketC", and rest will go into "BucketD". Sum of values in each bucket should be pretty close since you have 65k of values.
enter this formula
=IF(SUM($B$2:B2)/SUM($B$2:$B$100000)<0.25,"BucketA",IF(SUM($B$2:B2)/SUM($B$2:$B$100000)<0.5,"BucketB",IF(SUM($B$2:B2)/SUM($B$2:$B$100000)<0.75,"BucketC","BucketD")))
in cell C1 and drag it to the bottom.

There's lots of studies into algorithms that solve these types of problems. Your problem is actually the exact same format as the equal piles example in this article:
https://simple.wikipedia.org/wiki/P_versus_NP#Example
Considering the volume you're working with and the fairly narrow range of values, you could get a fairly good approximate solution by simply doing this:
Sort all items in descending order by value
In an adjacent column, put 1, 2, 3 and 4 against the first 4 values.
Use autofill to repeat that pattern against all values
You should now have 4 groups of fairly equal value

Taking the top values and averaging from data list in Excel

I have a data list in Excel, I am looking to take the top 3 values for each number, and get the average for those 3 values quickly. I often work with lists of up to 50,000 lines which at any one time could convert to over 10,000 different column A numbers.
I understand basic pivot tables to get an average after the top 3 values are collected, but need to find a way to remove all values that are not the top 3,
I trust this may be an extremely simple ask, or complex and thank you in advance for your help.

you can use =LARGE(Array, k) formula. For example, =LARGE(B:B, 1) is for 1-largest number, =LARGE(B:B, 2) is for 2-largest number etc.
If column contains many duplicates, and you want to get all occurences of top three values, use this formula to get all of them (put:
=IF(LARGE(B:B,ROW(A1))>=LARGE(B:B,COUNTIF(B:B,LARGE(B:B,COUNTIF(B:B,MAX(B:B))+1))+COUNTIF(B:B,MAX(B:B))+1),LARGE(B:B,ROW(A1)),"")

Picking top 5 scores from a range

I run a small golf eclectic with excel. One of the things we have is a points system. I would like to get the 5 highest points scored over the season and have them ranked from 1 (being the highest points scored) to 5.
My knowledge of excel "sums" goes only a wee bit further than add and subtract.
Thanks!

If you don't want to change the order that they are presently in you can use the LARGE function. It returns the kth largest value.
Below is a great formula, if you drag it down it will automatically get the second, third and nth largest value from a table of data (in this example the data is between A1 to A10).
=LARGE(A1:A10,ROW(A1)-ROW($A$1)+1)
You can then match the values with names or corresponding data from the tables using the MATCH and INDEX functions. The example below would fetch the name for each value from the second column.
=INDEX($A$1:$B$10,MATCH(cell reference with score or value,$A$1:$B$10,2))
Play around with these formulas, they are very convenient for data m

If you have a column containing the scores, you could add a filter (Data->Filter I think) and sort descending.
Though, if you just have rows that are something like [Date][Person][Score] you'll need to go to another sheet and SUM the scores for each person then sort that... Unfortunately my Excel skills aren't up to par to pull a score for each person like that.

Given a list of numbers in A1 to A10, you can work out their 'Rank' relative to each other by using 'RANK'.
e.g.
RANK(A1,A1:A6,0)
RANK(cell, list of cells to check against, order)
For order, 0 = descending.
From there you can work out which one is first pragmatically.

If you have Excel 2007,
Check that your data is continuous, with no blank rows or columns. Click on your scores and then select 'Data - Filter'
Using the dropdown that the filter creates at the top of your scores column and select 'Number filters - Top ten'
A 'Top ten Autofilter' dialog will be displayed, reduce the show 10 to 5 and then click on OK.
For earlier versions of Excel add a RANK formula in a new column. Be careful as the scores need to be sorted, usually into descending order. If there are any ties, they will be given the same ranking number and the subsequent rank number will be incremented by the number of ties. (E.g. If there are two scores of 2, ranked as 5. The next score will be ranked as 7, not 6)

If you want to use the LARGE Function as described above, make sure you put the same range in the list for each of the LARGE functions. That is, change =LARGE(A1:A10,ROW(A1)-ROW($A$1)+1) to =LARGE(A$1:A$10,ROW(A1)-ROW($A$1)+1) or you will get some strange incorrect results

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficient methods for determining the largest set of complete data in a large dataset - subset

Related

Find row which has one cell similar and other cell different than in another row

generate all possibilities from two fixed rows of entries?

How to create four equal buckets of decimal values

Taking the top values and averaging from data list in Excel

Picking top 5 scores from a range

Categories

Resources