Extract Uppercase Words on Excel Function - excel

I have supplier name together with product name in one cell as a string.
Each cell has a word that's all uppercase (sometimes with a digit or a number).
Data
I need to extract
3LAB Anti - Aging Oil 30ml
3LAB
3LAB Aqua BB SPF40 #1 14g
3LAB
3LAB SAMPLE Perfect Neck Cream 6ml
3LAB
3LAB SAMPLE Super h" Serum Super Age-Defying Serum 3ml"
3LAB
3LAB TTTTT Perfect Mask Lifting Firming Brightening 28ml
3LAB
3LAB The Cream 50ml
3LAB
3LAB The Serum 40ml
3LAB
4711 Acqua Colonia Intense Floral Fields Of Ireland EDC spray 170ml
EDC
4711 Acqua Colonia Intense Pure Brezze Of Himalaya EDC spray 50m"
EDC
I need to extract only that UPPERCASE supplier name to a new cell.
I've tried to create User Defined Function like this one, but it's not working.
It's returning #NAME? error.
Public Function UpperCaseWords(S As String) As String
Dim X As Long
Dim TempText As String
TempText = " " & S & " "
For X = 2 To Len(TempText) - 1
If Mid(TempText, X, 1) Like "[!A-Z ]" Or Mid(TempText, X - 1, 3) Like "[!A-Z][A-Z][!A-Z]" Then
Mid(TempText, X) = " "
End If
Next
UpperCaseWords = Application.Trim(TempText)
End Function
Any idea how to correct it and make it work?
I've found it here: https://www.mrexcel.com/board/threads/formula-to-extract-upper-case-words-in-a-text-string.684934/page-2#posts
And why in this macro, in line For X = 2 To Len(TempText) - 1 the X is set to 2?

Instead of a custom made UDF, try to utilize what Excel does offer through build-in functionality, for examle FILTERXML():
Formula used in B1:
=FILTERXML("<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>","//s[.*0!=0][translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789','')='']")
If you are using Microsoft365, add an # before the the function e.g: =#FILTERXML()....., or add [1] as the 3rd xpath-expression to tell the function to only return the 1st node that complied against the previous two rules we used.
Let's have an analysis on the formula:
FILTERXML() - We can utilize this function to "split" a string on a particular delimiter, a space in this case.
"<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>" - Here is the part where we create a parent-child structure of start/end-tags; a valid XML-string.
//s - In the 2nd parameter of FILTERXML() we start a valid xpath-expressions. We want all the s-nodes (childres from t-nodes) that comply to the following two rules:
[.*0!=0] - Select all nodes that when multiplied with zero are not the same as zero. Meaning we don't want pure numeric substrings returned.
[translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789','')=''] - When we translate() all the characters mentioned in the 2nd parameter of this function to nothing, the result should also be nothing, meaning the node is only made out of uppercase alpha chars and numbers.
A more in-depth explaination on how FILTERXML() works when you want to extract a particular substring can be found here. Happy coding!

How about: split the text into words, for each word ignore it if its numeric & return it if its the same as the upper case version of itself:
Public Function UpperCaseWords(S As String) As String
Dim i As Long
Dim word As String
Dim words() As String
words = Split(S, " ")
For i = 0 To UBound(words)
word = words(i)
If Not IsNumeric(word) And word = UCase$(word) Then
UpperCaseWords = word
Exit Function
End If
Next
End Function

Related

Best formula/method to extract a standard set of numbers from a string?

I have the following strings from which I need to extract 6 digit numbers. Since these strings are generated by another software, they occur interchangeably and I cannot control it. Is there any one method that would extract both 6-digit numbers from each of these strings?
Branch '100235 to 100236 Ckt 1' specified in table 'East Contingency' for record with primary key = 21733 was not found in branch or transformer data.
Loadflow branch ID '256574_701027_1' defined in supplemental branch table was not found in branch or transformer input.
Transmission element from bus number 135415 to bus number 157062 circuit ID = 1 defined for corridor 'IESO-NYISO' was not found in input data
I don't know VBA, but I can learn it if it means I can get the 6 digit numbers using a single method.
thanks
I have been using LEFT(), RIGHT() & MID() previously, but it means manually applying the appropriate formula for individual string.
If you have Microsoft 365, you can use this formula:
=LET(arr,TEXTSPLIT(SUBSTITUTE(SUBSTITUTE(A1,"'"," "),"_"," ")," "),
FILTER(arr,ISNUMBER(-arr)*(LEN(arr)=6)))
Thanks to #TomSharpe for this shorter version, using an array constant within TEXTSPLIT to add on possible delimiters.
=LET(arr,TEXTSPLIT(A1,{"_"," ",","," ","'"}),FILTER(arr,(LEN(arr)=6)*ISNUMBER(-arr)))
Data
Output
An alternative is:
=LET(ζ,MID(A1,SEQUENCE(,LEN(A1)-5),6),ξ,MID(ζ,SEQUENCE(6),1),FILTER(ζ,MMULT(SEQUENCE(,6,,0),1-ISERR(0+ξ))=6))
A couple more suggestions (if you need them):
(1) Replacing all non-digit characters with a space then splitting the resulting string:
=LET(numbers,TEXTSPLIT(TRIM(REDUCE("",MID(A1,SEQUENCE(1,LEN(A1)),1),LAMBDA(a,c,IF(is.digit(c),a&c,a&" "))))," "),FILTER(numbers,LEN(numbers)=6))
Here I've defined a function is.digit as
=LAMBDA(c, IF(c = "", FALSE, AND(CODE(c) > 47, CODE(c) < 58)))
(tl;dr I quite like doing it this way because it hides the implementation details of is.digit and creates a rudimentary form of encapsulation)
(2) A UDF - based on the example here and called as
=RegexTest(A1)
Option Explicit
Function RegexTest(s As String) As Double()
Dim regexOne As Object
Dim theNumbers As Object
Dim Number As Object
Dim result() As Double
Dim i As Integer
Set regexOne = New RegExp
' Not sure how you would extract numbers of length 6 only, so extract all numbers...
regexOne.Pattern = "\d+"
regexOne.Global = True
regexOne.IgnoreCase = True
Set theNumbers = regexOne.Execute(s)
i = 1
For Each Number In theNumbers
'...Check the length of each number here
If Len(Number) = 6 Then
ReDim Preserve result(1 To i)
result(i) = CDbl(Number)
i = i + 1
End If
Next
RegexTest = result
End Function
Note - if you wanted to preserve leading zeroes you would need to omit the Cdbl() and return the numbers as strings. Returns an error if no 6-digit numbers are found.

Extract different numbers from multiple strings

In a .csv spreadsheet, I have multiple strings with incrementing numerical values contained in each, and I need to extract the numbers from each string. For example, here are two strings:
DEVICE1.CM1 - 4.1.1.C1.CA_VALUE (A)
DEVICE1.CM2 - 6.7.1.C2.CA_VALUE (A)
DEVICE1.CM1 - 4.1.2.C1.CA_VALUE (A)
DEVICE1.CM1 - 4.1.2.C2.CA_VALUE (A)
DEVICE1.CM1 - 4.1.2.C3.CA_VALUE (A)
DEVICE1.CM1 - 5.1.1.C1.CA_VALUE (A)
DEVICE1.CM1 - 5.1.1.C2.CA_VALUE (A)
DEVICE1.CM1 - 5.10.1.C3.CA_VALUE (A)
DEVICE1.CM1 - 6.13.1.C10.CA_VALUE (A)
And I am looking to extract "4.1.1.C1" from the first string, and "6.7.1.C2" from the second string.
I have over 1000 strings, each with a different incremental value in the form of "#.#.#.C.#" and all of the options I have tried so far involve searching for a specific value to extract, rather than extracting all values of that general form. Is there any reasonable way to accomplish this?
I am not a big fan of regular expressions because they are often hard to read, but this is a typical example where you should use them. Read carefully the Q&A BigBen linked to in the comments.
Function extractCode(s As String) As String
Static rx As RegExp
If rx Is Nothing Then Set rx = New RegExp
rx.Pattern = "\d+\.\d+\.\d+\.C\d"
If rx.Test(s) Then
extractCode = rx.Execute(s)(0)
End If
End Function
(You will need to add the reference to the Microsoft VBScript Regular Expression library)
--> Updated my answer, you need to escape the dot, else it is a placeholder for any character and the pattern would also match something like "4x1y2zC3",
So here goes:
MID(A1,FIND("-",A1,1)+2,(FIND("_",A1,1)-FIND("-",A1,1))-5)
The fixed structure
(items) are always preceeded by " - " and followed by ".CA_VALUE (A)"
allows to isolate the code string via Split as follows:
consider ".CA_VALUE (A)" as closing delimiter, but change occurrence(s) to "- "
execute Split now on the resulting string using only the first delimiter (StartDelim "- ")
isolate the second token (index: 1 as split results are zero-based)
Function ExtractCode(ByVal s As String) As String
Const StartDelim As String = "- "
Const ClosingDelim As String = ".CA_VALUE (A)"
ExtractCode = Split(Replace(s, ClosingDelim, StartDelim), StartDelim)(1)
End Function
Another approach with focus on splitting via point delimiters //Edit 2021-11-20
If you want to experiment with a fixed start position of your 4-items code in a split array (based on point delimiters "."),
you might also consider the following approach:
split via point delimiters "."
filter only the 3rd,4th,5th and 6th item via WorksheetFunction.Index (by its columns argument)
join the resulting items again via connecting points "."
a) Using (Excel) version MS 365
Function ExtractCode(ByVal s As String, Optional startPos As Long = 3) As Variant
Const delim As String = "."
Dim tmp
tmp = Split(Replace(s, "- ", delim), delim) ' normalize hyphen to point delimiter
With Application.WorksheetFunction
ExtractCode = Join(.Index(tmp, 0, .Sequence(1, 4, startPos)), ".")
End With
End Function
b) Make it backwards compatible
Just change the function result assignment to
ExtractCode = Join(.Index(tmp, 0, Evaluate("{1,2,3,4}-1+" & startPos)), ".")
which in both cases changes the Index column argument to a 1-based column number Array(3,4,5,6)

Stata flag when word found, not strpos

I have some data with strings, and I want to flag when a word is found. A word would be defined as at the start of the string, end, or separated a space. strpos will find whenever the string is present, but I am looking for something similar to subinword. Does Stata have a way to use the functionality of subinword without having to replace it, and instead flag the word?
clear
input id str50 strings
1 "the thin th man"
2 "this old then"
3 "th to moon"
4 "moon blank th"
end
gen th_pos = 0
replace th = 1 if strpos(strings, "th") >0
This above code will flag every observation as they all contain "th", but my desired output is:
ID strings th_sub
1 "the thin th man" 1
2 "this old then" 0
3 "th to moon" 1
4 "moon blank th" 1
A small trick is that "th" as a word will be preceded and followed by a space, except if it occurs at the beginning or the end of string. The exceptions are no challenge really, as
gen wanted = strpos(" " + strings + " ", " th ") > 0
works around them. Otherwise, there is a rich set of regular expression functions to play with.
The example above flags that the code that doesn't do what you want condenses to one line,
gen th_pos = strpos(strings, "th") > 0
A more direct answer is that you don't have to replace anything. You just have to get Stata to tell you what would happen if you did:
gen WANTED = strings != subinword(strings, "th", "", .)
If removing a substring if present changes the string, it must have been present.
Regular expressions can be useful for this type of exercise, with word boundaries allowing you to search for whole words indicated by \b, as in "\bword\b".
gen wanted = ustrregexm(strings, "\bth\b")

Sum numbers in a delimited text and return average

I have the following numbers as text in a cell:
"12-14-14-16-18-10"
And now I need to calculate the average but I do not want to create extra columns since the length of the data varies.
Is there any way to do this using a formula?
In other words: you want to split the string value by the "-" character and calculate the average of its elements? AFAIK the only way to solve this is using a small macro (AKA user-defined function), since LO Calc doesn't provide a split/tokenize function on spreadsheet level.
A quick and dirty solution may look as follows:
Function split_average(a)
Dim theArray(UBound(Split(a, "-"))) As Integer
theArray = Split(a, "-")
Dim SumVal As Integer
For i = 0 To UBound(theArray)
SumVal = SumVal + theArray(i)
Next i
split_average = SumVal / (UBound(theArray) + 1)
End Function
Of course, there's no type checking and so on, so try on your own risk. To use it, just copy it into the StarBasic Standard module, save, and call it inside your spreadsheet using =split_average(A1). For user-defined functions in general, see the LO Calc docs.
I know this is an old thread, but let me add my two cents. Use FILTERXML and some XPATH magic =)
=FILTERXML("<t><s>"&REGEX(A1;"-";"</s><s>";"g")&"</s></t>";"sum(//s) div count(//s)")
We could even implement a check on numeric nodes:
=FILTERXML("<t><s>"&REGEX(A1;"-";"</s><s>";"g")&"</s></t>";"sum(//s[.*0=0]) div count(//s[.*0=0])")
Looking at this I find it unfortunate Excel won't allow direct manipulation inside the expression itself. Suprised LibreOffice does!

Excel 2007 find the largest number in text string

I have Excel 2007. I am trying to find the largest number in a cell that contains something like the following:
[[ E:\DATA\SQL\SY0\ , 19198 ],[ E:\ , 18872 ],[ E:\DATA\SQL\ST0\ , 26211 ],[ E:\DATA\SQL\ST1\ , 26211 ],[ E:\DATA\SQL\SD0\ , 9861 ],[ E:\DATA\SQL\SD1\ , 11220 ],[ E:\DATA\SQL\SL0\ , 3377 ],[ E:\DATA\SQL\SL1\ , 1707 ],[ E:\DATA\SQL_Support\SS0\ , 14375 ],[ E:\DATA\SQL_Support\SS1\ , 30711 ]]
I am not a coder but I can get by with some basic instructions. If there is a formula that can do this, great! If the best way to do this is some sort of backend code, just let me know. Thank you for your time.
I do have the following formula that almost gets me there:
=SUMPRODUCT(MID(0&A2,LARGE(INDEX(ISNUMBER(--MID(A2,ROW(INDIRECT("1:"&LEN($A$2))),1))*ROW(INDIRECT("1:"&LEN($A$2))),0),ROW(INDIRECT("1:"&LEN($A$2))))+1,1)*10^ROW(INDIRECT("1:"&LEN($A$2)))/10)
With a cell that contains a string like above, it will work. However, with a string that contains something like:
[[ E:\DATA\SQL\SY0\ , 19198.934678 ],[ E:\ , 18872.2567 ]]
I would end up with the value of 19198934678 as the largest value.
You can use this UDF:
Function MaxInString(rng As String) As Double
Dim splt() As String
Dim i&
splt = Split(rng)
For i = LBound(splt) To UBound(splt)
If IsNumeric(splt(i)) Then
If splt(i) > MaxInString Then
MaxInString = splt(i)
End If
End If
Next i
End Function
Put this in a module attached to the workbook. NOT in the worksheet or ThisWorkbook code.
Then you can call it like any other formula:
=MaxInString(A1)
If there is always a space before and after, you can use this formula. The formula is an array formula and must be confirmed by holding down ctrl + shift while hitting enter
With your string in A1:
=MAX(IFERROR(--TRIM(MID(SUBSTITUTE(A1," ",REPT(" ",99)),IF(seq=1,1,(seq-1)*99),99)),0))
seq is a defined name that refers to:
=ROW(INDEX(Sheet1!$1:$65536,1,1):INDEX(Sheet1!$1:$65536,255,1))
If a VBA UDF is preferable, I suggest the following. The Regex will match anything that might be a number. The number is expected to be in the format of iiii.dddd The integer part and the decimal point are both optional.
Option Explicit
Function LargestNumberFromString(S As String) As Double
Dim RE As Object, MC As Object, M As Object
Dim D As Double
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.Pattern = "\b[0-9]*\.?[0-9]+\b"
If RE.test(S) = True Then
For Each M In MC
D = IIf(D > CDbl(M), D, CDbl(M))
Next M
End If
End With
End Function
This is going to be a rather complex answer for someone with no programming background, so be prepared to spend a lot of time covering and researching this topic if you really wish to achieve a function that finds the largest number in a string in excel.
The solution, requires the use of VBA and Regular expressions
VBA is used in excel when there is a need for more complex functionality that just can't be achieved with the use of built in spreadsheet functions.
Regular expressions are a language used to tell programs how to extract useful information from texts, in this case we can extract all the numbers in your text. with the following regular expression.
(\d+.?\d*)/g
Which roughly means: Match one or more digits with an optional period and subsequent optional digits.
The program that will interpret this will do the following: Look for digits, if you see one, then that's a match, grab all contiguous digits and add them to the match. Once you find a character that is not a digit, start looking for new matches. if at any point you find a dot, add it to the match, but just once, and keep on looking for digits. Rinse and repeat until the end of the text.
You can test it here. In this case, the regex matches 19 numbers.
http://www.regextester.com/
Once you have a collection with the 19 matches (See link to regular expressions), all you would need to do is to loop over each of the matches to find out which of the numbers is the highest:
for each number in matches
if number > highestNumber then
highestNumber = number
end if
next
And highestNumber will be the the result! In order to have this code run in a simple custom function, you can follow this microsoft tutorial ( https://support.office.com/en-us/article/Create-Custom-Functions-in-Excel-2007-2f06c10b-3622-40d6-a1b2-b6748ae8231f?ui=en-US&rs=en-US&ad=US&fromAR=1 )
Where c is your string to find max from
Dim qwe() As String
qwe = Split(c, ", ")
maxed = 0
For x = LBound(qwe) To UBound(qwe)
qwe(x) = Left(qwe(x), InStr(1, qwe(x), " ", vbBinaryCompare))
On Error Resume Next
If CLng(qwe(x)) > maxed Then maxed = CLng(qwe(x))
Next x
MsgBox maxed
The error line is there to ignore when qwe(x) cannot be converted to a LONG number.
I must say this is very specific to your string format, for a more comprehensive doodad you'd want to have the split delimiter as a variable and possibly use the "IsNumeric" function to scan the entire string.

Resources