Most Efficient Way to Loop Through Columns of Identical Data - Without Duplicates - excel
I have coded something to evaluate the result of every possible combination of inputs, in hopes to optimize a solution.
I have three identical columns of inputs, and my loop cycles through them all in search of the best combinations of inputs to yield the highest output. Example:
475,475,475
391,391,391
24,24,24
999,999,999
Duplicates are not allowed. I have been able to error correct for this per iteration, but not iteration v. iteration. As an example the first result I evaluate is 475 391 24.
QUESTION: The order of the inputs have no impact on the result I am evaluating. My dataset is so large, it is time consuming to evaluate 475 391 24 and then later again evaluate 391 475 24, and then again evaluate 24 391 475. Is there aa way to design around this? I am unable to manipulate the the source data. I have only a modest VBA skillset, but even the basic concept of solving this problem would be helpful. I imagine this is a common problem in many programming languages.
A possible solution would be to use a dictionary in some way: whenever you read the 3 values, sort them, put them in a CSV string in ascending order, then lookup in the dictionary.
It is hard to tell for sure if this will accelerate your code, because you need to make sure if your evaluation function is more or less expensive than the operation of sorting and dictionary lookup.
Another possibility is to remove duplicates from your data after doing some tranformation on it. Suppose your data is in columns A, B and C. Generate data into D, E and F with formulas in the following way:
the first column holds the min: Di--> =min(Ai:Ci)
the third column is the max: Fi --> =max(Ai, Ci))
the second column is the middle Ei--> =sum(Ai:Ci) - Di - Fi)
after then, select the range in D,E,F, copy and paste it as values, Remove Duplicates, and finally, apply your procedure on the remaining data.
If the order of inputs has no impact, you could first sort the lists. Then choose an item from the first column, an item from the second column that is greater the the first item, and an item from the third that is greater than the second. This assures that each combination is tried only once, not 6 times (so a ~6X speedup if the initial sort not too burdensome).
Related
Excel multilevel array formula with partial string matches to sum resultant cells
I've been trying to sort this for over a day now without much luck. I have successfully used SUMIFS, INDEX, MATCH, COUNTIF, "--" etc array functions previously and am not a novice, but also not an expert on these. I can't seem to weave these together correctly, and likely on an altogether incorrect path. Basically, I am trying to aggregate data from multiple spreadsheets, requiring a mapping of various items (rows) into a canonical form for summing. The image here shows a representative, but simplified version of my quest. Each "region" on this example spreadsheet (Final..., Mapping, DataSet1, DataSet2) is actually in different spreadsheets, and there are several sheets with 50-150 rows in each xlsx. Note that the names in Column B are quite arbitrary (meaning not all P1's have an 'x' pattern, like shown here as x1, x2, etc. Do not rely on any pattern in the names, except the x, y , z in the Mapping table are substrings (case insensitive, trailing match) of the names in Column B in the DataSets. And in the image, the Final Result Table (summed manually) is what I want to compute via(an array) formula: A single formula would be ideal (given I have many spreadsheets from which the monthly data is being pulled from, so I can't readily modify but can create an interim spreadsheet if required, so open to helper columns or helper rows). Here's the process - For each name (B3-B5) in the Final Result Table, I want to sum the name from it's components as follows: Lookup all the matches in the Mapping Table (so for P1, the formula =IF($C$10:$C$15=$B3, $B$10:$B$15,"") gives {"x1";"";"";"x2";"";"x3"}. I then want to search each of x1, x2, and x3 in B19:B26 to get rows 21, 22, 24, 25, 26 in DataSet1 and B31:B35 to get row 32 in DataSet2, to then add up the Jan totals into C3. (Effectively, C3=C21+C22+C24+C25+C26+C32). Same for P2 and P3, and thru Feb, Mar, ... I am stuck on how to remove blank or 0 or Div0 or such "error rows" from the interim result in 2, and also need to use 2 arrays of different sizes (3 valid rows in example 2 above, ignoring blanks) to search many rows in DataSets. I tried SEARCH("*"&IF($C$10:$C$15=$B3, $B$10:$B$15,""), $B$19:$B$26) but get unexpected results. I have tried to replace text in the interim result {"x1";"";"";"x2";"";"x3"} with TRUE/FALSE, and 1/0, etc. to help with INDEX or MATCH, but am stymied by errors in downstream ("surrounding") formulas. Thanks in advance.
Here is a solution without resorting to nasty (imo) CSE formulas. = SUMPRODUCT($C$19:$F$26*(COUNTIFS($B$10:$B$15, RIGHT($B$19:$B$26,2),$C$10:$C$15,$B3)>0)*($C$18:$F$18=C$2)) + SUMPRODUCT($C$31:$F$35*(COUNTIFS($B$10:$B$15, RIGHT($B$31:$B$35,2),$C$10:$C$15,$B3)>0)*($C$30:$F$30=C$2)) There is one SUMPRODUCT for each data set. If possible, it would be better to put all your data sets into a single table with a column identify which data set it is a part of. The way it works is to takes each values in your data set and multiplies it by whether the 2 right most character appear in your mapping table for that P code, multiplied by whether the value is in the correct month. So it returns 0 if either of those conditions are false. Then returns the sum. UPDATE IN RESPONSE TO OP COMMENTS If, the X,Y, Z codes are not always 2 digits but the first part is ALWAYS 8 digits, you can easily amend the: RIGHT($B$19:$B$26,2) to be: RIGHT($B$19:$B$26,LEN($B$19:$B$26)-8) Making the formula for the first data set: =SUMPRODUCT($C$19:$F$26*(COUNTIFS($B$10:$B$15, RIGHT($B$19:$B$26,LEN($B$19:$B$26)-8),$C$10:$C$15,$B3)>0)*($C$18:$F$18=C$2)) And you can amend for other data sets and simply add them together.
Nice challenge! Are you willing to drop all your tables (DataSet1, DataSet2...) into one spreadsheet, so that we can refer just one single range for each month? Here's one solution (hopefully a good starting point) - array formula (Ctrl+Shift+Enter): =SUMPRODUCT(IFERROR(IF(TRANSPOSE(IF($B3=$C$10:$C$15,$B$10:$B$15,""))=RIGHT($B$18:$B$36,2),C$18:C$36,0),0))
excel if and if error formula that has used 140 times and it throws an errors saying we can use it only 64 times
I have 140 unique numbers and trying to find that through the list which can be used in vba The formula works fine till 64 ifs are used, later I am having a troublethe result should return the number mentioned and I am planning to sort them in ascending order. The value in A2 looks like PMGAG5216GC, PMG005216GC, PMGVV5140GC, PMG005140GC, PMGVV5148GCW, PMGAG5117GCW, PMG005117GCW, PMGAG5204GCB, PMG005204GCB, PMGAG5238GCB, PMGVV5238GCB, PMG005238GCB, PMGAG5203GCB, etc. these are some sample order numbers that are being updated and the numbers 5238 is a number that I have to find from that order to sort them in ascending order. In the same way, I have 140 numbers that have to found to sort them accordingly. The 4 digit numbers are fixed in the orders and it should be one from the 140 number list that I had mentioned
Rule of thumb, if you see yourself nesting anything deeper than 5 or 6 levels, stop and take the time to see if there wouldn't be a more easily maintainable way to do the same thing. Hitting hard limits (e.g. 64 levels of nesting) is rarely a sign that things are done in an optimal fashion. PMGAG5216GC PMG005216GC PMGVV5140GC PMG005140GC PMGVV5148GCW PMGAG5117GCW PMG005117GCW PMGAG5204GCB PMG005204GCB PMGAG5238GCB PMGVV5238GCB PMG005238GCB PMGAG5203GCB Assuming the format is consistently the same, you can grab the 4 characters starting at the 6th position, and then verify if these 4 characters exist in a lookup table that contains the 140 values you're interested in. The MID function can be used to do this. You could leverage the fact that VLOOKUP in the first column of the lookup table would return the lookup value itself, and a lookup failure would be #N/A, so wrapping it with IFERROR to turn that into an empty string would look like this: =IFERROR(VLOOKUP(MID(A2,6,4),theLookupTable[TheLookupColumn],1,FALSE),"") Now, if looks like some of the values need a prefix e.g. "00000A-"; include that prefix (with the dash, so you don't have to conditionally add it in the formula) in the lookup table (say, in some [Prefix] column) where it's needed, and just concatenate it after the lookup. =IFERROR(VLOOKUP(MID(A2,6,4),theLookupTable[TheLookupColumn],1,FALSE) & VLOOKUP(MID(A2,6,4),theLookupTable[#[TheLookupColumn]:[ThePrefixColumn]],2,FALSE),"") Better if you can turn the MID(A2,6,4) part into a helper cell instead of computing it twice - use that MID function on your source data to populate the lookup table. The lookup table might look like this: TheLookupColumn ThePrefixColumn 5216 00000A- 5140 00000B- 5148 00000C- ... 3901 ... Sort the table by TheLookupColumn, and the lookups should be pretty fast.
If you just want to show the first number from your lookup list which is contained in any given order number you can do something like this: It's an array formula so you need to enter it using Ctrl + Shift + Enter Assumes there can be only one match per order number and that none of the items in your lookup list are substrings of another item (though a workaround for that would be to sort your lookup list in descending order of item length)
Frequency() with arrays: adds an element to return arrays
I'm using the following formula as named formula (via name manager). It is then used in a larger sumproduct(). The goal is to ensure that with an array calculation, the calculation is only made once for certain groups of rows (e.g. you have the same data repeated accross many rows for category A. I only need to know how many people are in category A once). =IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID], tdata[reportUUID],0),0),IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID], tdata[reportUUID],0),0))>0,TRUE) Let's step through the results one by one with the evaluate formula in Excel. Sorry for the screenshot, but Excel doesn't allow to copy actual steps with real data.... In order of steps: In the last image, there's now a 7th item in my array. I only have 6 row of data, hence why for the previous steps I only had 6 items in the array, as was expect. This is messing up my calculations, because the return array from this function gets multiplied by others arrays which all have 6 items (or whatever is the number of data rows I have). What is this 7th item, and how can I either get ride of it or prevent it from return errors? I did try to wrap some formula into iferror() or ifna(), however it doesn't feel clean. I feel this might backfire and isn't a strong way to handle this. I rather take it at the source.... EDIT: For example of use with other arrays: {=SUMPRODUCT(--IFERROR(((tdata[_isVisible]=1)*(f_uniqueUUIDfactor),0))} Where f_uniqueUUIDfactor is the formula from the initial post. tdata[_isVisible]=1 is used as a way to filter data on the dashboard (e.g. through dropdown, the users can set ranges for dates, and with VBA I hide the rows in the raw data NOT within the range). The point is that sumproduct() ends up multipliying each raw data row thogheter as 0 & 1 s, so that only those meeting all the criterias get returned. The IFERROR() above is the workaround for the extra array element introduced by frequency(). It works as is, but if a cleaner way exists I'd prefer that. I would also be keen on understanding why that elements get added.
This is a good example of why it is preferable to use multiple, recursive IF statements when evaluating arrays over multiple criteria, rather than form the product of those arrays. Firstly, though, before coming to the reason for that statement, I should point out a few minor technical inaccuracies/flaws with your construction also. 1) By including a value_if_false clause in your constructions being passed as FREQUENCY's data_array and bins_array parameters, you are risking incorrect results, since zero is a valid numerical to be considered by FREQUENCY, whereas a Boolean FALSE (which would be the equivalent entry in the resulting array had you omitted the value_if_false clause altogether) is disregarded by this function. 2) MATCH with an exact (i.e. 0, or FALSE) match_type parameter is a relatively resource-heavy construction, particularly if the range to be considered is quite large. As such, and since it is not necessary to use this construction for FREQUENCY's bins_array parameter, it is preferable to use the more efficient: ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1 Moreover, note that repetition of the IF(LEN construction is also not necessary within this second parameter. In all, then: IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],tdata[reportUUID],0)),ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1)>0,TRUE) is considerably more rigorous and more efficient than the version you give. To answer your main question, it is well-documented that FREQUENCY always returns an array having a number of entries one greater than that of the bins_array passed. As mentioned in my comment to your post, the resolution to the problem you are facing largely depends on precisely what further manipulation you are intending for the resulting array. However, let's assume for the sake of an explanation that you simply wish to multiply the array resulting from your FREQUENCY construction by some other column within your table, tdata[Column2] say, and then sum the result. The difference between: =SUM(IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],tdata[reportUUID],0)),ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1)>0,TRUE)*tdata[Column2]) i.e. using multiplication of the two arrays, and: =SUM(IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],tdata[reportUUID],0)),ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1)>0,tdata[Column2])) i.e. using a straightforward IF clause, is here crucial. In fact, the former will always return an error, whereas the latter, in general, will not. The reason is that the former will resolve to (assuming that your table has e.g. 10 rows' worth of data and assuming some random Boolean results to the FREQUENCY construction): =SUM(IF({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE},TRUE)*tdata[Column2]) which is, since the value_if_true clause is superfluous here: =SUM({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE}*tdata[Column2]) whereas the second construction I give will resolve to: =SUM(IF({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE},tdata[Column2])) The two may look identical, but the fact that the former is using multiplication to resolve the array, whereas the latter is not, is the key difference. Although in both cases the array resulting from the FREQUENCY construction, i.e.: {TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE} comprises 11 entries (i.e. 1 more than the number of entries in the second array being considered), the difference is that, when you then attempt to multiply an 11-element array with a 10-element array (i.e. tdata[Column2]), Excel, rather than outright disallowing such an operation, artificially redimensions the smaller of the two arrays such that it matches the dimensions of the larger. In doing so, however, any additional entries are automatically set as #N/A error values. Effectively, then: =SUM({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE}*tdata[Column2]) is resolved as: =SUM({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE}*{38;67;49;3;10;11;97;20;3;57;#N/A}) i.e., as mentioned, the second, 10-element array is redimensioned to one of 11 elements in an attempt to form a legitimate operation. And, as also mentioned, that 11th element is #N/A, which means of course that the entire construction will also result in that value. In the non-multiplication version, however, i.e.: =SUM(IF({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE},tdata[Column2])) although the same redimensiong also takes place, we are saved by our use of an IF clause in place of multiplication, since the above resolves to: =SUM(IF({TRUE;FALSE;TRUE;FALSE;TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE},{38;67;49;3;10;11;97;20;3;57;#N/A})) and the Boolean FALSE in the 11th position here 'overrides' the error value in the equivalent position from the second array, since the above resolves to: =SUM({38;FALSE;49;FALSE;10;11;97;FALSE;3;FALSE;FALSE}) Regards
Excel array function for checking monthly values
I have an array equation to tell me the number of unique values in a column (D) based on whether the date field in another column (B) is in a particular month. My equation is: =SUM(IF(MONTH($B$2:$B$63)=10,(IF(FREQUENCY(IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""), IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""))>0,1))),0) This works great for October and when I change the 10 value to be another number it works for all months except january. So you can see if I have done a copying error here is the cell relating to January: =SUM(IF(MONTH($B$2:$B$63)=1,(IF(FREQUENCY(IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""), IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""))>0,1))),0) This always returns "N/A" Any ideas why?
There are a few things wrong with your construction. Firstly, the array you are using for the bins_array parameter, which is derived from your MATCH construction combined with an IF statement, is forcing FREQUENCY to return an array containing less than 62 elements. When this array is then compared with the initial IF clause, i.e. IF(MONTH($B$2:$B$63)=1, which does contain 62 elements, you have an issue, and, where possible, the way in which Excel resolves a comparison between two arrays of differing sizes is to artificially increase the smaller of the two so that it is of a dimension equal to that of the larger. Of course, in doing this, it fills in the missing values with #N/As (what else could it do?). Hence your result. In any case, repetition of the MATCH construction is not necessary for the bins_array parameter, and forces unnecessary extra calculation. As such, I am always surprised to see how many sources still recommend this set-up. Finally, any IF clauses should appear within the FREQUENCY construction, not without. Overall: =SUM(IF(FREQUENCY(IF(LEN(D2:D63)>0,IF(MONTH($B$2:$B$63)=1,MATCH(D2:D63,D2:D63,0))),ROW(D2:D63)-MIN(ROW(D2:D63))+1),1)) is what you should be using. Regards
nested excel functions with conditional logic
Just getting started in Excel and I was working with a database extract where I need to count values only if items in another column are unique. So- below is my starting point: =SUMPRODUCT(COUNTIF(C3:C94735,{"Sharable Content Object Reference Model 1.2","Authored SCORM/AICC content","Authored External Web Content"})) what i'd like to figure out is the syntax to do something like this- =sumproduct (Countif range1 criteria..., where range2 criteria="is unique value") Am I getting this right? The syntax is a bit confusing, and I'm not sure I've chosen the right functions for the task.
I just had to solve this same problem a week ago. This method works even when you can't always sort on the grouping column (J in your case). If you can keep the data sorted, #MikeD 's solution will scale better. Firstly, do you know the FREQUENCY trick for counting unique numbers? FREQUENCY is designed to create histograms. It takes two arrays, 'data' and 'bins'. It sorts 'bins', then creates an output array that's one longer than 'bins'. Then it takes each value in 'data' and determines which bin it belongs in, incrementing the output array accordingly. It returns the array. Here's the important part: If a value appears in 'bins' more than once, any 'data' value meant for that bin goes in the first occurrence. The trick is to use the same array for both 'data' and 'bins'. Think it through, and you'll see that there's one non-zero value in the output for each unique number in the input. Note that it only counts numbers. In short, I use this: =SUM(SIGN(FREQUENCY(<array>,<array>))) to count unique numeric values in <array> From this, we just need to construct arrays containing numbers where appropriate and text elsewhere. In the example below, I'm counting unique days when the color is red and the fruit is citrus: This is my conditional array, returning 1 or true for the rows I'm interested in: ($A$2:$A$10="red")*ISNUMBER(MATCH($B$2:$B$10,{"orange","grapefruit","lemon","lime"},0)) Note that this requires ctrl-shift-enter to be used as an array formula. Since the value I'm grouping by for uniqueness is text (as is yours), I need to convert it to numeric. I use: MATCH($C$2:$C$10,$C$2:$C$10,0) Note that this also requires ctrl-shift-enter So, this is the array of numeric values within which I'm looking for uniqueness: IF(($A$2:$A$10="red")*ISNUMBER(MATCH($B$2:$B$10,{"orange","grapefruit","lemon","lime"},0)),MATCH($C$2:$C$10,$C$2:$C$10,0),"") Now I plug that into my uniqueness counter: =SUM(SIGN(FREQUENCY(<array>,<array>))) to get: =SUM(SIGN(FREQUENCY( IF(($A$2:$A$10="red")*ISNUMBER(MATCH($B$2:$B$10,{"orange","grapefruit","lemon","lime"},0)),MATCH($C$2:$C$10,$C$2:$C$10,0),""), IF(($A$2:$A$10="red")*ISNUMBER(MATCH($B$2:$B$10,{"orange","grapefruit","lemon","lime"},0)),MATCH($C$2:$C$10,$C$2:$C$10,0),"") ))) Again, this must be entered as an array formula using ctrl-shift-enter. Replacing SUM with SUMPRODUCT will not cut it. In your example, you'd use something like: =SUM(SIGN(FREQUENCY( IF(ISNUMBER(MATCH($C$3:$C$94735,{"Sharable Content Object Reference Model 1.2","Authored SCORM/AICC content","Authored External Web Content"},0)),MATCH($J$3:$J$94735,$J$3:$J$94735,0),""), IF(ISNUMBER(MATCH($C$3:$C$94735,{"Sharable Content Object Reference Model 1.2","Authored SCORM/AICC content","Authored External Web Content"},0)),MATCH($J$3:$J$94735,$J$3:$J$94735,0),"") ))) I'll note, though, that scaling might be a problem on data sets as large as yours. I tested it on larger data sets, and it was fairly fast on the order of 10k rows, but really slow on the order of 100k rows, such as yours. The internal arrays are plenty fast, but the FREQUENCY function slows down. I'm not sure, but I'd guess it's between O(n log n) and O(n^2) depending on how the sort is implemented. Maybe this doesn't matter - none of this is volatile, so it'll just need to calculate once upon refreshing the data. If the column data is changing, though, this could be painful.
Asuming the source data is sorted by the key value [A], start with determining the occurence of the key column B2: =IF(A2=A1;B1+1;1) Next determine a group sum C2: =SUMIF($A$2:$A$9;A2;$B$2:$B$9) A key is unique if its group sum is exactly 1 D2: =(C2=1) To count records which match a certain criterium AND are unique, include column D in a =IF(AND(D2, [yourcondition];1;0) and sum this column Another option is to asume a key unique within a sorted list if it is unequal to both its predecessor and successor, so you could find the unique records like E2: =AND(A2<>A1;A2<>A3) G2: =IF(AND(E2;F2="this");1;0) E and G can of course be combined into one single formula (not sure though if that helps ...) G2(2): =IF(AND(AND(A2<>A1;A2<>A3);F2="this");1;0) resolving unnecessarily nested AND's: G2(3): =IF(AND(A2<>A1;A2<>A3;F2="this");1;0) all formulas in row 2 should be copied down to the end of the list