Excel Find alphanumeric string from text - excel

I have a column containing multiple string values, like a sentence.
in that sentence i want to find one or all alphanumeric values of 10 or more characters containing atleast one - , and put the resulting values in another column.
For example:
the column containing sentence is like:
upgrade 15.07.2010, old No: WI82-01062. User moved to No: WI12-01012 02.04.2012 to a 2 user network.
or
Upgrade from lite 7/6/07, old No: PTX7-89C367EC5052-01211
Ideally I want a column with values like WI82-01062, WI12-01012 for the first example, and PTX7-89C367EC5052-01211 for the second example.
May be searching for the - in the string and finding the first occurrence of blank space at both ends would help, but I do not have any clue how to write that in excel term.
Thanks

You could probably use a regex like this (there may be better patterns!):
Function ExtractData(r As Variant) As String
Static oRE As Object
Dim sTemp As String
Dim n As Long
Dim matches
If oRE Is Nothing Then
Set oRE = CreateObject("vbscript.regexp")
With oRE
.Pattern = "[A-Za-z0-9\-]{10,}"
.Global = True
End With
End If
Set matches = oRE.Execute(r)
If matches.Count > 0 Then
For n = 1 To matches.Count
sTemp = sTemp & ", " & matches(n - 1)
Next n
ExtractData = Mid$(sTemp, 3)
End If
End Function

Related

Extract one specific key and value from a comma separated list in every cell of a column

I'm trying to extract 'manufacturer=acme' from, for example:
attribute1=red,attribute2=medium,manufacturer=acme,attribute4=spherical
from column 'attributes', for which there are 8000+ rows.
I can't use left(), right(), split() functions because the manufacturer attribute doesn't have a fixed number of attributes/characters to the left or right of it and split() only works for one character, not a string.
Is there a way I can achieve this, target the string manufacturer= and remove all text from the left and right starting from its encapsulating commas?
Quick mock-up for looping through a split string (untested):
dim stringToArray as variant: stringToArray = split(target.value, ",")
dim arrayItem as long
for arrayLocation = lbound(stringToArray) to ubound(stringToArray)
if instr(ucase(stringToArray(arrayLocation)), ucase("manufacturer=")) then
dim manufacturerName as string: manufacturerName = right(stringToArray(arrayLocation), len(stringToArray(arrayLocation))-len("manufacturer="))
exit for
end if
next arrayLocation
debug.print manufacturerName
I have, maybe, an overkill solution using RegExp.
Following is a UDF you can use in a formula
Public Function ExtractManufacturerRE(ByRef r As Range) As String
On Error GoTo RETURN_EMPTY_STR
Dim matches As Object
With CreateObject("VBScript.RegExp")
.Pattern = "manufacturer=[^,]+"
.Global = False
Set matches = .Execute(r.Value)
If matches.Count > 0 Then
ExtractManufacturerRE = matches.Item(0).Value
End If
End With
RETURN_EMPTY_STR:
End Function
To be fair, this is sub-optimal, plus it doesn't work on a range but only on a single cell.

Get the total number of group names where there are multiple group names in a single cell surrounded by random text

This is regarding an export of a list from a Sharepoint site where I do not have access to exporting feature, but I can refresh the information in order to gather new information when is added on the Sharepoint site.
The excel list has several columns, out of which one of them is containing names of groups that were created.
There are several thousands of rows, with cells where each cell might contain this kind of data.
I want to be able to only count the number of groups in my excel sheet.
I was initially thinking to find a way to export somehow the data as text and to then try to remove any parts in the string BEFORE the rows (again, all these rows are in the same cell) in order to make the data uniform and to have only the name of the groups and then to try count in a different column the number of items in each cell and to have a sum.
"Please delete following CORP group:
PRD.12.SYOS.EXOA.XW12LAPP0171.TWSUSERS
PRD.12.SYOS.XW12LAPP0171.Administrators
PRD.12.SYOS.EXOA.XW12LAPP0063.TWSAGENT
PRD.12.SYOS.EXOA.XW12LAPP0063.TWSUSERS
PRD.12.SYOS.VM4P.XW12LAPP0063.ADMINISTRATORS
PRD.12.SYOS.VM4P.XW12LAPP0063.RDP
PRD.12.SYOS.XW12LAPP0063.Administrators"
I am a beginner in excel and in coding. I tried using functions in excel like RIGHT, LEFT, MID, LEN but I am still unable to produce anything close to what I need.
I need the total number of the groups in the sheet.
with a helper column you could extract the part after the last dot with
=MID(A1,LOOKUP(9^9,FIND(".",A1,ROW(1:999)))+1,50)
and then count the number of unique group names, which occur with
=SUMPRODUCT((E1:E99<>"")/COUNTIF(E1:E99,E1:E99&""))
Be aware of the fact, that each Administrators and Administrators" get counted as a unique group. You need to strip off unnecessary chars with SUBSTITUTE if you dont want that behaviour.
When your list doesn't have doublures you could use a rather simple formula like:
=LEN(A1)-LEN(SUBSTITUTE(A1,CHR(10),""))-1
It would also be a nice issue to use REGEX on, like so for example:
Function CountGroup(RNG As Range) As Double
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "(?=.*\d)[^\n " & Chr(34) & "]+"
.Global = True
End With
Set Matches = regex.Execute(RNG.Value)
CountGroup = Matches.Count
End Function
Or when your list can contain doublures:
Function CountGroup(RNG As Range) As Double
Dim regex As Object, ARR1() As String, X As Long, ARR2() As String
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "(?=.*\d)[^\n " & Chr(34) & "]+"
.Global = True
End With
Set matches = regex.Execute(RNG.Value)
ReDim ARR1(X)
For Each HIT In matches
ARR2 = Filter(SourceArray:=ARR1, Match:=HIT.Value, Include:=True, Compare:=vbTextCompare)
If UBound(ARR2) = -1 Then
ReDim Preserve ARR1(X)
ARR1(X) = HIT.Value
X = X + 1
End If
Next HIT
If IsEmpty(ARR1) Then
CountGroup = 0
Else
CountGroup = UBound(ARR1) - LBound(ARR1)
End If
End Function
Call in sheet through =CountGroup(A1)

Remove words that contain each other and leave the longer one

I'm looking for a macro (preferably a function) that would take cell contents, split it into separate words, compare them to one another and remove the shorter words.
Here's an image of what I want the output to look like (I need the words that are crossed out removed):
I tried to write a macro myself, but it doesn't work 100% properly because it's not taking the last words and sometimes removes what shouldn't be removed. Also, I have to do this on around 50k cells, so a macro takes a lot of time to run, that's why I'd prefer it to be a function. I guess I shouldn't use the replace function, but I couldn't make anything else work.
Sub clean_words_containing_eachother()
Dim sht1 As Worksheet
Dim LastRow As Long
Dim Cell As Range
Dim cell_value As String
Dim word, word2 As Variant
Set sht1 = ActiveSheet
col = InputBox("Which column do you want to clear?")
LastRow = sht1.Cells(sht1.Rows.Count, col).End(xlUp).Row
Let to_clean = col & "2:" & col & LastRow
For i = 2 To LastRow
For Each Cell In sht1.Range(to_clean)
cell_value = Cell.Value
cell_split = Split(cell_value, " ")
For Each word In cell_split
For Each word2 In cell_split
If word <> word2 Then
If InStr(word2, word) > 0 Then
If Len(word) < Len(word2) Then
word = word & " "
Cell = Replace(Cell, word, " ")
ElseIf Len(word) > Len(word2) Then
word2 = word2 & " "
Cell = Replace(Cell, word2, " ")
End If
End If
End If
Next word2
Next word
Next Cell
Next i
End Sub
Assuming that the retention of the third word in your first example is an error, since books is contained later on in notebooks:
5003886 book books bound case casebound not notebook notebooks office oxford sign signature
and also assuming that you would want to remove duplicate identical words, even if they are not contained subsequently in another word, then we can use a Regular Expression.
The regex will:
Capture each word
look-ahead to see if that word exists later on in the string
if it does, remove it
Since VBA regexes cannot also look-behind, we work-around this limitation by running the regex a second time on the reversed string.
Then remove the extra spaces and we are done.
Option Explicit
Function cleanWords(S As String) As String
Dim RE As Object, MC As Object, M As Object
Dim sTemp As String
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.Pattern = "\b(\w+)\b(?=.*\1)"
.ignorecase = True
'replace looking forward
sTemp = .Replace(S, "")
' check in reverse
sTemp = .Replace(StrReverse(sTemp), "")
'return to normal
sTemp = StrReverse(sTemp)
'Remove extraneous spaces
cleanWords = WorksheetFunction.Trim(sTemp)
End With
End Function
Limitations
punctuation will not be removed
a "word" is defined as containing only the characters in the class [_A-Za-z0-9] (letters, digits and the underscore).
if any words might be hyphenated, or contain other non-word characters
in the above, they will be treated as two separate words
if you want it treated as a single word, then we might need to change the regex
General steps:
Write cell to array (already working)
for each element (x), go through each element (y) (already working)
if x is in y AND y is longer that x THEN set x to ""
concat array back into string
write string to cell
String/array manipulations are much faster than operations on cells, so this will give you some increase in performance (depending on the amount of words you need to replace for each cell).
The "last word problem" might be that you dont have a space after the last word within your cells, since you only replace word + " " with " ".

Regex pattern to remove certain prefixes in a word from Excel

I am trying to cleanup a set of strings in Excel to extract certain words after removing some prefixes and extra characters. Initially I was trying this with FIND, LEFT, MID, etc. Then, I came across this helpful post and trying my hand at regex.
https://superuser.com/questions/794536/excel-formulas-for-stripping-out-prefix-suffix-around-number
I have used the UDF given there called Remove which takes a regex argument. Now, I am still not able to remove all the items I wanted to remove.
In the attached Excel you can see what I have tried and what the answer I am looking.
Here are the Prefixes I wanted to remove:
The numbers in the beginning surrounded by brackets - Ideally I want this in a separate column.
Anyword before a hyphen here there are a number of them 'l-', 'al-'
and then these prefixes below.
 bi
 bil
 fa
 wa
 wal
How do I write a single regex which would remove all the above prefixes?
Here is the UDF I am using:
Function Remove(objCell As Range, strPattern As String)
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = strPattern
Remove = RegEx.Replace(objCell.Value, "")
End Function
Here is the link to the XLSM file which contains the data I have:
https://www.dropbox.com/s/et9ee727ompj5fl/Regex%20Trials.xlsm?dl=0
and here is a screenshot to show you what I am looking for:
Not 100% perfect for words but should get you started
Breakdown of RegEx (\d+\:)+\d+
(\d+\:) finds any patterns that match the format x:
the plus after the bracket then tells it that this is a repeating pattern.
lastly the \d+ matches the last digit in the string so that the regex will find a pattern that matches x:x:x
The next RegEx (?!l-|al-|a-|wa-|fa-|bi-)[a-z].* is a lot more complex.
First of all lets look at the [a-z]. This tells it to match any character between a and z. We then want to capture the rest of the word so by using .* it captures everything from the first match to the end of the string (this includes non a-z characters). However, we don't want it to capture the first part of the string before the hyphen (in most cases) so by using ?! We use what's called negative look ahead. This looks for anything inside the brackets and ignores those bits. | simply means or. so anything inside that bracket will be ignored from the match.
Go to http://regexr.com/ if you want to have a play around is a handy site to learn/test RegEx
Public Sub test()
Dim rng As Range
Dim matches
Dim c
With Sheet1
Set rng = .Range(.Cells(2, 1), .Cells(.Cells(.Rows.Count, 1).End(xlUp).Row, 1))
End With
For Each c In rng
With c
.Offset(0, 6) = ExecuteRegEx(.Value2, "(\d+\:)+\d+")
.Offset(0, 7) = ExecuteRegEx(.Value2, "(?!l-|al-|a-|wa-|fa-|bi-)[a-z].*")
End With
Next c
End Sub
Public Function ExecuteRegEx(str As String, pattern As String) As String
Dim RegEx As Object
Dim matches
Set RegEx = CreateObject("VBScript.RegExp")
With RegEx
.Global = True
.ignorecase = False
.pattern = pattern
If .test(str) Then
Set matches = .Execute(str)
ExecuteRegEx = matches(0)
Else
ExecuteRegEx = vbNullString
End If
End With
End Function
I wouldn't use a regex for this: you can do some splitting of the cell value and testing of the prefixs against a defined array of prefixs:
Note: the array values are in an order where substrings of other prefixs are later in the list
Public Function RemovePrefix(RngSrc As Range) As String
If RngSrc.Count > 1 Then Exit Function
On Error GoTo ExitFunction
Dim Prefixs() As String: Prefixs = Split("wal,wa',wa,bil,bi,fa", ",")
Dim Arr() As String, i As Long, Temp As String
Arr = Split(RngSrc, "-")
If UBound(Arr) > 0 Then
RemovePrefix = Arr(UBound(Arr))
Exit Function
End If
Arr = Split(RngSrc, " ")
For i = 0 To UBound(Prefixs)
Temp = Arr(UBound(Arr))
If InStr(Temp, Prefixs(i)) = 1 Then
RemovePrefix = Right(Temp, Len(Temp) - Len(Prefixs(i)))
Exit Function
End If
Next i
RemovePrefix = Temp
ExitFunction:
If Err Then RemovePrefix = "Error"
End Function

How to extract text within a string of text

I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.
I have a file with multiple rows (all one column). Each row has data that looks something like this:
1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom
2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate
etc...
What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.
What I would hope the output would look like would be something like:
297848936,297338191
168010496
etc...
Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/
You would call it like this, assuming that first string is in A1:
=RegexExtract(A1,"gi[|](\d+)[|]")
Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:
297848936, 297338191
Just run this down the column and you're all done!
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
Next
Next
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function
Here it is (assuming data is in column A)
=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))
Not the nicest formula, but it will work to extract the number.
I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.
In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:
=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ),
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1)
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2))
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))
How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.
To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!
I'd probably split the data first on the | delimiter using the convert text to columns wizard.
In Excel 2007 that is on the Data tab, Data Tools group and then choose Text to Columns. Specify Other: and | as the delimiter.
From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.
As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.
Just saw that #Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.
Option Explicit
Option Base 0
Sub findGi()
Dim oCell As Excel.Range
Set oCell = Sheets(1).Range("A1")
'Loops through every row until empty cell
While Not oCell.Value = ""
oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
Set oCell = oCell.Offset(1, 0)
Wend
End Sub
Private Function GetGi(ByVal sValue As String) As String
Dim sResult As String
Dim vArray As Variant
Dim vItem As Variant
Dim iCount As Integer
vArray = Split(sValue, "|")
iCount = 0
'Loops through the array...
For Each vItem In vArray
'Searches for the 'Gi' factor...
If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then
'Concatenates the results...
sResult = sResult & vArray(iCount + 1) & ","
End If
iCount = iCount + 1
Next vItem
'And removes trail comma
If Len(sResult) > 0 Then
sResult = Left(sResult, Len(sResult) - 1)
End If
GetGi = sResult
End Function
open your excel in Google Sheets and use the regular expression with REGEXEXTRACT
Sample Usage
=REGEXEXTRACT("My favorite number is 241, but my friend's is 17", "\d+")
Tip: REGEXEXTRACT will return 241 in this example because it returns the first matching case.
In your case
=REGEXEXTRACT(A1,"gi[|](\d+)[|]")

Resources