Normalize text with Excel - excel

I need a function to normalize data.
For example: in column A there are the reference data (written correctly) and in column B the data to correct.
Is there a function that can find the most similar data (maybe based on consecutive characters in common) and substitute the wrong written data with the right one, as the following example?
Thanks
Tried some formulas but did'nt work

If data does not get more complicated than that, try:
Formula in C1:
=LET(x,TOCOL(A:A,3),MAP(TOCOL(B:B,3)&" ",LAMBDA(a,XLOOKUP(TEXTJOIN("*",0,"",TEXTSPLIT(a,TEXTSPLIT(a,CHAR(ROW(65:90)),,1),,1),""),VSTACK(x,SUBSTITUTE(x," ",)),VSTACK(x,x),,2))))
Idea here is:
Variable x - Return an array of values in column A;
Use MAP() to loop all values in column B;
First TEXTSPLIT() the value on all uppercase chars to then split the input based on the remainder;
TEXTJOIN() the returned array with an asterisk to allow for wildcard matching;
Use XLOOKUP() for a wildcard match in an array where 'x' is vstacked with it's space-removed counterpart.
This should now return the 1st match that has the same sequence of characters.
I do agree with the comment-section that for a better match on similarity, you may want to look at other options. One option that I could think of while still using formulae is:
=LET(x,TOCOL(A:A,3),MAP(TOCOL(B:B,3)&" ",LAMBDA(a,LET(z,FILTER(VSTACK(x,x),ISNUMBER(SEARCH(TEXTJOIN("*",0,"",TEXTSPLIT(a,TEXTSPLIT(a,CHAR(ROW(65:90)),,1),,1),""),VSTACK(x,SUBSTITUTE(x," ",))))),#SORTBY(z,LEN(z))))))
Way more verbose, but here you'd FILTER() all values that could be a wildcard match using the same logic as the other formula. Though here, the outcome is filtered by length, making it a little bit more likely to return the value that shared the best similarity.

Related

Excel sumifs with two values throwing an error

I am trying to get a sum of column C if A is abc or cba and B is def:
=SUMIFS(C2:C51;A2:A51;{"abc","cba"};B2:B51;"def")
But the formula is not valid, not sure where is my mistake since this was proposed in a quick google search.
Thank you for your suggestions.
The formula is valid for me, but this might be an issue with your delimiter. Depending on your excel, windows or location settings you might need to use a comma , as a delimiter, instead of a semicolon ;.
As for your formula, for completion I've done the same google search and ended up with this reference. It seems your logic in the formula is correct apart from one crucial step, the SUM( wrapping around your formula. This means if your formula works, it will only take the first hit into account, but with the sum, it will count every entry where your logic is True. Syntax:
=SUM(SUMIFS(C2:C51,A2:A51,{"abc","cba"},B2:B51,"def"))
Or semicolon delimited:
=SUM(SUMIFS(C2:C51;A2:A51;{"abc";"cba"};B2:B51;"def"))
Since the {array} option does not seem to be working for you, I propose a workaround as follows:
=SUMIFS(C1:C15;A1:A15;"abc";B1:B15;"def")+SUMIFS(C1:C15;A1:A15;"cba";B1:B15;"def")
This is a more clunky function, but reaches the same result by splitting up the data in two SUMIFS( functions and adding the results together.
Probably I would use #Plutian answer (actually I upvoted), but in case it might work for you, you can use SUMPRODUCT combined with DOUBLE UNARY to get exactly what you want.
DOUBLE UNARY
SUMPRODUCT
I made a fake dataset like this:
As you can see, only the highlighted values meet your requirements ( if A=abc OR cba AND B=def)
My formula in E10 is:
=SUMPRODUCT(--($A$2:$A$7="abc")+--($A$2:$A$7="cba");--($B$2:$B$7="def");$C$2:$C$7)
This is how it works:
($A$2:$A$7="abc") will return an array of True/False values if condition is met.
That array, because it's inside a double unary operator --( your range ), will convert all True/False values into 1 or 0 values. Let's say it works like if you would have selected a range of cells that contains only 1 or 0. So this will return an array like {1,0,1,0,1,0} in this case
--($A$2:$A$7="cba") will do exactly the same than steps 1 or 2 again, but with your second option. It will return another array of values, in this case, {0,1,0,1,0,1}
--($A$2:$A$7="abc")+--($A$2:$A$7="cba") we are just summing up both arrays, so {1,0,1,0,1,0}+{0,1,0,1,0,1}={1,1,1,1,1,1}
--($B$2:$B$7="def") will do like steps 1 and 2 again with your third condition, and will return another array, now it will be {1,0,1,0,0,1}
The array obtained in step 5 then it's multiplied to array obtained in step 4, so we are doing {1,1,1,1,1,1} * {1,0,1,0,0,1}={1,0,1,0,0,1}
Now, that final array obtained in step 7 then it's multiplied by the values of cells $C$2:$C$7, so in this case is {1,0,1,0,0,1} * {10,1,10,1,1,10} = {10,0,10,0,0,10}
And final step, we sum up all values inside array obtained in last step, so we do 10+0+10+0+0+10=30
I've explained every step to make sure everybody can understand, because SUMPRODUCT it's really an useful function if you know how to hanlde (I'm a noob, but I've seen real heroes here on SO using this function).
The advantage of using SUMPRODUCT instead of SUMIFS is that you can easily add more conditions to apply same range (case --($A$2:$A$7="abc")+--($A$2:$A$7="cba") or single condition to additional ranges (case --($B$2:$B$7="def")).
With normal SUMIFS probably you would have to add 1 extra complete SUMIF for each condition applied in same range.
Hope this helps

How can I force the Excel SMALL function to return all results satisfying an IF criterion?

I have a spreadsheet in which I would like to return the text of all cells in MyTable field "Result" in which the corresponding text in MyTable field "Query" matches a string in $C$1, as a list. I want to do this using formulas. I know that my formula will need to have the following form:
=INDEX(MyTable[Result],SMALL(IF(MyTable[Query]=C1,ROW(MyTable[Result]),""),___))
The issue is that I'm not sure what should go in the ___. I want my formula to return strings from all relevant cells, not just a fixed number of cells, because I don't know now many cells match the criteria. As far as I understand, SMALL throws an error if the number provided as the last argument exceeds the number of non-empty strings returned by the IF function. Is there any way around this?
You k argument can be an array. And the upper bounds can probably be calculated using COUNTIF in your case.
Arrays can be constructed in a variety of ways.
One way (not the most efficient) would be something like:
=row(indirect("1:" & countif(MyTable[Query],C1)))
You can also use the INDEX function to construct the array, and, if speed is an issue, this would be better as it is a non-volatile function.

Is there a function, that would return "" when input ref is empty, and its contents if it is not?

IMO Excel has weird treatment of empty Cells.
I am building a complex array formula. One of the referenced ranges contain cells, that may, or may not be empty, and if not empty, they can contain both numeric values and strings.
What function can I use, to get the value of the cell if the cell is not empty, and "" (or anything other non-numeric, e.g. #N/A) if the cell is empty?
I want to get something like this working:
=MIN(OFFSET(<column vector that contains text, numbers and empty cells>;<row vector of indices>-1;0))
This form of formula returns an #ARG error, as was explained in the answer to Why this array formula doesn't work?.
But when I prefix the OFFSET with N, it transforms any empty cell into 0, so the net result is 0 (unless there are negative numbers in the column vector).
=MIN(L(OFFSET(<column vector>;<row vector of indices>-1;0)))
Is there any formula, that only dereferences the reference returned by OFFSET preserving the "emptyness" of the empty cell? Or maybe there is an alternate way of solving the problem, like
=MIN(IF(OFFSET(<column vector>;<row vector of indices>-1;0)="",L(OFFSET(<column vector>;<row vector of indices>-1;0)),""))
(This example also fails with #ARG, because, as I understand, I need to dereference the array reference for the = test as well).
If it is at all possible, I prefer to keep with Excel 2007 set of built-in functions. And no VBA.
I would accept any solution, that uses constant number of cells irregardless of the size of each input array.
EDIT:
As a side remark I wonder what is wrong with the arrays returned by OFFSET anyway? This simple example works perfectly:
...while the array returned by OFFSET somehow wants to be alone in the formula.
There may be another option but I don't see it at the moment.....
You can filter out zeroes by using an IF like this
=MIN(IF(N(OFFSET(INDIRECT($A$2),$C4:$G4-1,0))<>0,N(OFFSET(INDIRECT($A$2),$C4:$G4-1,0))))
but that won't distinguish between any actual zeroes in your range and those produced when the N function encounters blanks or text
Edit
This version should work
=MIN(IF(COUNTBLANK(OFFSET(INDIRECT($A$2),$C4:$G4-1,0,1))+LEN(T(OFFSET(INDIRECT($A$2),$C4:$G4-1,0,1))),"",N(OFFSET(INDIRECT($A$2),$C4:$G4-1,0,1))))
Perhaps not completely relevant here but the problem is finding functions that can deal with the "array of references" returned by OFFSET with this type of setup - N and T work as shown here and also COUNTBLANK. Other functions that can be used on the OFFSET output are SUBTOTAL and COUNTIF. Note that COUNTBLANK (along with SUBTOTAL and COUNTIF) can work on ranges while T and N will only work with single values - if the latter functions are applied to ranges they simply look at the first value in the range - because of that I was able to use OFFSET without the "height" parameter but you need that with COUNTBLANK (and it's a good habit to get in to so OFFSET should have the final 1 as here
=OFFSET(INDIRECT($A$2),$C4:$G4-1,0,1)
Consider in B1:
=IF(ISBLANK(A1),"nothing there",A1)
So if A1 contains the formula:
=""
Then B1 will also display the null.
Another possibility:
=MIN(CELL("contents",OFFSET(INDIRECT($A$2),N(INDEX($C4:$G4-1,)),0)))
Note: The N(INDEX(...)) part is a trick used to enforce array evaluation.

Why does Excel MATCH() not find a match?

I have a table with some numbers stored as text (UPC codes, so I don't want to lose leading zeros). COUNTIF() recognizes matches just fine, but MATCH() doesn't work. Is there a reason why MATCH() can't handle numbers stored as text, or is this just a limitation I'll have to work around?
Functions like MATCH, VLOOKUP and HLOOKUP need to match data type (number or text) whereas COUNTIF/SUMIF make no distinction. Are you using MATCH to find the position or just to establish whether the value exists in your data?
If you have a numeric lookup value you can convert to text in the formula by using &"", e.g.
=MATCH(A1&"",B:B,0)
....or if it's a text lookup value which needs to match with numbers
=MATCH(A1+0,B:B,0)
If you are looking for the word test for example in cell A2, type the following:
=MATCH(""&"test"&"",A2,0)
If this isn't working then try =Trim and =Clean to purify your column.
If =Trim and =Clean don't help, then just use the left 250 characters...
(As the Match Formula may encounter a timeout/overflow after 250 characters.)
=Left(A2, 250)
If you are using names to refer to the ranges, once you have fixed the datatypes also redefine any names which refer to those ranges.

simplifying Excel formula currently using INDEX, ROW, SUMPRODUCT and IFERROR

Does anyone have any brilliant ideas to simplify this difficult formula? Don't panic when you see it, I will try to explain.
=IFERROR(INDEX(rangeOfDesiredValues,(1/SUMPRODUCT((rangeOfSerials=$D20)(rangeOfApps=cfgAppID)(rangeOfAccessIDs=cfgAccessID)*ROW(rangeOfDesiredValues))^-1)),"")
Currently I am using SUMPRODUCT to do the equivalent of a VLOOKUP with multiple columns as criteria. Usually that only works with number results, but since I need to find text, I'm using SUMPRODUCT in combination with ROW and INDEX.
Unfortunately when no cell is found, my SUMPRODUCT returns 0. This causes the formula to return the incorrect cell rather than blank. For this reason I am running the result through this calculation:
(1 / result)^-1
This way results of 0 become an error, and other results remain unchanged. I feed this into IFERROR, so that errors become blanks.
Does anyone know how to make this neater? I am not able to create new columns in any of my spreadsheets.
It's always best to avoid using multi-condition summing functions like SUMPRODUCT when you want to find a single value (it would obviously give you an incorrect result or error if there's more than one row which matches all three conditions, I assume you expect one match at most here?). ROW function can also be problematic if you insert any rows in the worksheet.....
There are several approaches that can work. For a single formula, using MATCH is the most common - MATCH will only give the correct position or an error so no problems with zero values. That would look like this:
=IFERROR(INDEX(rangeOfDesiredValues,MATCH(1,(rangeOfSerials=$D20)*(rangeOfApps=cfgAppID)*(rangeOfAccessIDs=cfgAccessID),0)),"")
That's an "array formula" that needs to be entered with CTRL+SHIFT+ENTER......or you can make it into a regular formula with an extra INDEX function like this
=IFERROR(INDEX(rangeOfDesiredValues,MATCH(1,INDEX((rangeOfSerials=$D20)*(rangeOfApps=cfgAppID)*(rangeOfAccessIDs=cfgAccessID),0),0)),"")
A third alternative is to use LOOKUP which doesn't need "array entry"
=IFERROR(LOOKUP(2,1/(rangeOfSerials=$D20)/(rangeOfApps=cfgAppID)/(rangeOfAccessIDs=cfgAccessID),rangeOfDesiredValues),"")
That differs slightly from the previous versions in the case of multiple matches - it will give you the last match rather than the first in that scenario (but I assume you have only one match at most, as stated above).
Finally, if you don't mind using helper columns you could simplify the formulas considerably. Just use a "helper" column to concatenate the three criteria columns separated by dashes and then you can use a simple VLOOKUP or INDEX/MATCH, e.g.
=IFERROR(INDEX(rangeOfDesiredValues,MATCH($D20&"-"&cfgAppID&"-"&cfgAccessID,Helper_Column,0)),"")

Resources