Performance problem with INDEX function and names introduced inside LET/LAMBDA - excel

Excel's INDEX function shows strange behaviour when the object being indexed is a name introduced in a LET construct or a parameter of a LAMBDA. The behaviour is consistent on Windows and Mac.
Suppose cell B2 contains the formula
=SEQUENCE(50000)
Then the following formulas are all relatively performant:
=MAP(SEQUENCE(2000),LAMBDA(x,INDEX(B2#,x)))
=LAMBDA(MAP(SEQUENCE(2000),LAMBDA(x,INDEX(B2#,x))))()
=LET(mgen,LAMBDA(B2#),LAMBDA(MAP(SEQUENCE(2000),LAMBDA(x,INDEX(mgen(),x)))))()
The following formulas, however, are terribly slow
=LET(m,B2#,MAP(SEQUENCE(2000),LAMBDA(x,INDEX(m,x))))
=LET(m,B2#,LAMBDA(MAP(SEQUENCE(2000),LAMBDA(x,INDEX(m,x)))))()
=LAMBDA(m,MAP(SEQUENCE(2000),LAMBDA(x,INDEX(m,x))))(B2#)
The performance problem gets the worse the longer array B2 is. You can crash Excel with the latter 3 formulas when you replace 50000 by 500000 in B2, while the first three formulas still work perfectly fine.
Note that the length of B2 (the array indexed) should in theory not have any impact on performance, since INDEX is called the exact same number of times in all my examples.
To me, INDEX seems to have performance problems whenever the first argument does not directly refer to a worksheet range.
Yet if that is so -- how can I efficiently (in constant time) get the n-th element of a LET/LAMBDA-named array?
I cannot work around this by writing the indexed array into a cell, since in my case, the indexed array is the result of another lambda.
Edit for clarification: The only purpose of the MAP/SEQUENCE(2000) construct in my examples is to make 2000 separate calls to INDEX, in order for the performance differences to become visible. The construct is completely unrelated to the problem. The performance problems occur whenever I make a lot of INDEX calls with the first argument being a LET/LAMDA name.
Second edit: It seems, however, that some kind of loopy construct is needed to reproduce the problem. I have not been able to reproduce the problem with 2000 separate LET/INDEX formulas in 2000 cells.

Related

Calculate the minimum value of each column in a matrix in EXCEL

Alright this should be a simple one.
I apologize in case it has been already solved, but I can only find posts related to solving this issue with programming languages and not specifically to EXCEL.
Furthermore, I could find posts that address a sub-problem of my question (e.g. regarding limitation of certain EXCEL functions) and should solve/invalidate my request but maybe, just maybe, there is a workaround.
Problem statement:
I want to calculate the minimum value for each column in an EXCEL matrix. Simply enough, I want to input a 2D array (mxn matrix) in a function and output an array with dimension 1xm where each item is the minimum value MIN(nj) of each nj column.
However, I want to solve this with specific constraints:
Avoid using VBA and other non-function scripting: that I could devise myself;
All in one function: what I want to achieve here is to have one and one function only, not split the problem into multiple passages (such as for example copypasting a MIN() function below each column, that wouldn't do it);
The result should be a transposable array (which is already ok, I assume);
Where I am stranded with my solution so far:
The main issue here is that any function I am trying to use takes the entire matrix as a single array input and would calculate the MIN() of the entire matrix, not each column. My current (not working) function for an exemplary 4x4 matrix in range A1:D4 would be as below (the part in bold is where it is clearly not working):
=MIN(INDEX(A1:D4,SEQUENCE(4,4,1,1)))
which ofc does not work, because INDEX() does probably not "understand" SEQUENCE() as an array of items to take into account. Another, not working, way of solving this is to input a series of ranges (A1:A4;B1:B4;C1:C4;D1:D4) so that INDEX() "understands" the ranges as single columns, but ofc does not know and I do not know sincerely how to formulate that. I could use INDIRECT() in some way to reference the array of ranges, but do not know how and could find a way by searching online.
Fundamental question is: can a function, which works with single arrays, also work with multiple arrays? Basically, I do not know how to communicate an EXCEL array formula, that each batch of data I am inputting is a single array and must be evaluated separately (this is very easily solved with for() cycles, I know).
Many thanks for any suggestion and any workaround, any function and solution works as longs as it fits in the constrains defined above (maybe a LAMBA() function? don't know).
This is ofc a simplification of a way more complex problem (I am trying to calculate the annual mean temperature evolution for a specific location by finding the value - for each year from 1950 to 2021 - that is associated to the lat/lon coordinates that are the nearest to the one of the location inputted, given a netCDF-imported grid of time-arrayed data; the MIN() function is used to selected the nearest location, which is then used, via INDEX() to find temp data). I need to do this in one hit (meaning just pasting the function, which evaluates a matrix of data that is referenced by a fixed range), so that I can just use it modularly for other data sets. I already have a working solution, which is "elegant"* enough, but not "elegant"* as the one I could develop solving this issue.
*where "elegant"= it saves me one click every time for 1000+ datasets when applying the function.
If I understand your problem correct then this should solve it:
=BYCOL(A1:D4,LAMBDA(d,MIN(d)))

Frequency() with arrays: adds an element to return arrays

I'm using the following formula as named formula (via name manager). It is then used in a larger sumproduct(). The goal is to ensure that with an array calculation, the calculation is only made once for certain groups of rows (e.g. you have the same data repeated accross many rows for category A. I only need to know how many people are in category A once).
=IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],
tdata[reportUUID],0),0),IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],
tdata[reportUUID],0),0))>0,TRUE)
Let's step through the results one by one with the evaluate formula in Excel. Sorry for the screenshot, but Excel doesn't allow to copy actual steps with real data....
In order of steps:
In the last image, there's now a 7th item in my array. I only have 6 row of data, hence why for the previous steps I only had 6 items in the array, as was expect.
This is messing up my calculations, because the return array from this function gets multiplied by others arrays which all have 6 items (or whatever is the number of data rows I have).
What is this 7th item, and how can I either get ride of it or prevent it from return errors?
I did try to wrap some formula into iferror() or ifna(), however it doesn't feel clean. I feel this might backfire and isn't a strong way to handle this. I rather take it at the source....
EDIT: For example of use with other arrays:
{=SUMPRODUCT(--IFERROR(((tdata[_isVisible]=1)*(f_uniqueUUIDfactor),0))}
Where f_uniqueUUIDfactor is the formula from the initial post. tdata[_isVisible]=1 is used as a way to filter data on the dashboard (e.g. through dropdown, the users can set ranges for dates, and with VBA I hide the rows in the raw data NOT within the range).
The point is that sumproduct() ends up multipliying each raw data row thogheter as 0 & 1 s, so that only those meeting all the criterias get returned. The IFERROR() above is the workaround for the extra array element introduced by frequency(). It works as is, but if a cleaner way exists I'd prefer that. I would also be keen on understanding why that elements get added.
This is a good example of why it is preferable to use multiple, recursive IF statements when evaluating arrays over multiple criteria, rather than form the product of those arrays.
Firstly, though, before coming to the reason for that statement, I should point out a few minor technical inaccuracies/flaws with your construction also.
1) By including a value_if_false clause in your constructions being passed as FREQUENCY's data_array and bins_array parameters, you are risking incorrect results, since zero is a valid numerical to be considered by FREQUENCY, whereas a Boolean FALSE (which would be the equivalent entry in the resulting array had you omitted the value_if_false clause altogether) is disregarded by this function.
2) MATCH with an exact (i.e. 0, or FALSE) match_type parameter is a relatively resource-heavy construction, particularly if the range to be considered is quite large. As such, and since it is not necessary to use this construction for FREQUENCY's bins_array parameter, it is preferable to use the more efficient:
ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1
Moreover, note that repetition of the IF(LEN construction is also not necessary within this second parameter.
In all, then:
IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],tdata[reportUUID],0)),ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1)>0,TRUE)
is considerably more rigorous and more efficient than the version you give.
To answer your main question, it is well-documented that FREQUENCY always returns an array having a number of entries one greater than that of the bins_array passed.
As mentioned in my comment to your post, the resolution to the problem you are facing largely depends on precisely what further manipulation you are intending for the resulting array.
However, let's assume for the sake of an explanation that you simply wish to multiply the array resulting from your FREQUENCY construction by some other column within your table, tdata[Column2] say, and then sum the result.
The difference between:
=SUM(IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],tdata[reportUUID],0)),ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1)>0,TRUE)*tdata[Column2])
i.e. using multiplication of the two arrays, and:
=SUM(IF(FREQUENCY(IF(LEN(tdata[reportUUID])>0,MATCH(tdata[reportUUID],tdata[reportUUID],0)),ROW(tdata[reportUUID])-MIN(ROW(tdata[reportUUID]))+1)>0,tdata[Column2]))
i.e. using a straightforward IF clause, is here crucial.
In fact, the former will always return an error, whereas the latter, in general, will not.
The reason is that the former will resolve to (assuming that your table has e.g. 10 rows' worth of data and assuming some random Boolean results to the FREQUENCY construction):
=SUM(IF({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE},TRUE)*tdata[Column2])
which is, since the value_if_true clause is superfluous here:
=SUM({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE}*tdata[Column2])
whereas the second construction I give will resolve to:
=SUM(IF({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE},tdata[Column2]))
The two may look identical, but the fact that the former is using multiplication to resolve the array, whereas the latter is not, is the key difference.
Although in both cases the array resulting from the FREQUENCY construction, i.e.:
{TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE}
comprises 11 entries (i.e. 1 more than the number of entries in the second array being considered), the difference is that, when you then attempt to multiply an 11-element array with a 10-element array (i.e. tdata[Column2]), Excel, rather than outright disallowing such an operation, artificially redimensions the smaller of the two arrays such that it matches the dimensions of the larger.
In doing so, however, any additional entries are automatically set as #N/A error values.
Effectively, then:
=SUM({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE}*tdata[Column2])
is resolved as:
=SUM({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE}*{38;67;49;3;10;11;97;20;3;57;#N/A})
i.e., as mentioned, the second, 10-element array is redimensioned to one of 11 elements in an attempt to form a legitimate operation. And, as also mentioned, that 11th element is #N/A, which means of course that the entire construction will also result in that value.
In the non-multiplication version, however, i.e.:
=SUM(IF({TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE},tdata[Column2]))
although the same redimensiong also takes place, we are saved by our use of an IF clause in place of multiplication, since the above resolves to:
=SUM(IF({TRUE;FALSE;TRUE;FALSE;TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE},{38;67;49;3;10;11;97;20;3;57;#N/A}))
and the Boolean FALSE in the 11th position here 'overrides' the error value in the equivalent position from the second array, since the above resolves to:
=SUM({38;FALSE;49;FALSE;10;11;97;FALSE;3;FALSE;FALSE})
Regards

Excel array function for checking monthly values

I have an array equation to tell me the number of unique values in a column (D) based on whether the date field in another column (B) is in a particular month.
My equation is:
=SUM(IF(MONTH($B$2:$B$63)=10,(IF(FREQUENCY(IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""), IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""))>0,1))),0)
This works great for October and when I change the 10 value to be another number it works for all months except january. So you can see if I have done a copying error here is the cell relating to January:
=SUM(IF(MONTH($B$2:$B$63)=1,(IF(FREQUENCY(IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""), IF(LEN(D2:D63)>0,MATCH(D2:D63,D2:D63,0),""))>0,1))),0)
This always returns "N/A"
Any ideas why?
There are a few things wrong with your construction.
Firstly, the array you are using for the bins_array parameter, which is derived from your MATCH construction combined with an IF statement, is forcing FREQUENCY to return an array containing less than 62 elements.
When this array is then compared with the initial IF clause, i.e. IF(MONTH($B$2:$B$63)=1, which does contain 62 elements, you have an issue, and, where possible, the way in which Excel resolves a comparison between two arrays of differing sizes is to artificially increase the smaller of the two so that it is of a dimension equal to that of the larger.
Of course, in doing this, it fills in the missing values with #N/As (what else could it do?). Hence your result.
In any case, repetition of the MATCH construction is not necessary for the bins_array parameter, and forces unnecessary extra calculation. As such, I am always surprised to see how many sources still recommend this set-up.
Finally, any IF clauses should appear within the FREQUENCY construction, not without.
Overall:
=SUM(IF(FREQUENCY(IF(LEN(D2:D63)>0,IF(MONTH($B$2:$B$63)=1,MATCH(D2:D63,D2:D63,0))),ROW(D2:D63)-MIN(ROW(D2:D63))+1),1))
is what you should be using.
Regards

Array evaluation of find_text argument in SEARCH() function

Say I have the following:
Entering the following formulas in cell C1 and then clicking Evaluate Formula->Evaluate produces very different results:
Formula 1: B$1:B$5 evaluates as non-array
{=SEARCH(B$1:B$5,A1)}
Formula 2: B$1:B$5 evaluates as an array
{=IF(SEARCH(B$1:B$5,A1),"")}
Why, exactly, is this? What is the cause of this behavior? If possible, please provide other examples using other Excel functions to illustrate what is happening here.
Parenthetically:
My question came about while experimenting with the accepted answer to this question.
In general, an array of values will only be returned by a worksheet function given that the following two conditions are satisfied:
1) The formula in question is either in itself capable of returning an array of values, or else is contained within a larger set-up of several functions, one or more of those which precede the function in question (and therefore act upon it) having that property. Whether that capability is something which requires coercion (i.e. via array-entry (CSE)) or is an in-built feature of the function is not important in terms of the answer you are seeking.
2) The array generated must be passed to a further function for processing. Excel is more teleological than you think: it has no great belief in returning an array of values as an end in itself.
As for your example, it's not that SEARCH, when array-entered, isn't capable of processing arrays (it is). It's more that there is no further function incited which is to act upon that array. In the IF version, there is precisely that, though again, if you process that one more time you'll find that your current array is reduced to just the first element in that array. Wrap a further function around the IF, e.g. SUM, and you'll be able to go one step further, and so on and so on.
And here is a major difference between evaluating formulas via the Evaluate Formula tool, and repeated "evaluation" via selecting various parts of the function in the formula bar and pressing F9.
The latter will always return an array of values, whether the above two conditions are satisfied or not. However - and not many people realise this - the "evaluation" so obtained can, ultimately, lead to incorrect results, and so should only be used providing one is aware of its limitations.
Take the following example, for instance:
With A1:A10 empty, the formula:
=SUMPRODUCT(0+(A1:A10=""))
correctly returns 10.
Now select just the part A1:A10 in the formula bar and press F9. Excel, being forced to "evaluate" the range, returns:
=SUMPRODUCT(0+({0;0;0;0;0;0;0;0;0;0}=""))
which, on further processing, results (correctly, it would seem) in the quite different result of 0.
Regards

INDIRECT() returns #VALUE! unexpectedly

Background: I'm using Excel functions to parse a lot of data out, essentially creating a flexible pivot table. It sorts a lot of race timing data by car, etc. In this portion of the sheet, I'm searching for the minimum segment times for each car. The rest of the sheet avoids macros and VBA so I'd like to avoid that here.
Issue: My formula worked when there are no zeros, but sometimes there are zeros that I need to exclude. My array formula is pretty complicated, but the change I made that broke it is this:
OLD (working):
{=min(if(car_number = indirect("number_vector"), indirect("data_vector")))}
NEW (non-working):
{=min(if(and(car_number = indirect("number_vector"),not(0=indirect("data_vector"))), indirect("data_vector")))}
I am using INDIRECT() with this exact argument several times in the formula. However, in this particular instance (inside the NOT()), it returns #VALUE! instead of {data1;...;datan}. Please see the screencaps below.
Before evaluation:
After evaluation:
I suspect that your AND function might be a problem - AND only returns a single result not an array of results as required, try using multiple IFs like this
=min(if(car_number = indirect("number_vector"),IF(indirect(data_vector)<>0, indirect(data_vector))))
Note that I also used <> rather than using NOT
Are data vector and number vector the same size and shape? (both vertical?)
why are there quotes around one but not the other?

Resources