extract multiple expressions - excel

I have a cell that contains usernames assigned to projects like this
,FC757_random_name,AP372_another_one,FC782_again_different,FC082_samesamebutdifferent,
I need to only extract the alphanumeric values the expressions start with, so everything in between , and _.
I made it work for one expression with the following, but I need all of them.
= MID(A1;FIND(",";A1)+1;FIND("_";A1)-FIND(",";A1)-1)
I also tinkered with Text to Data, but couldn't make it work for multiple lines at once.
Ideally this would work only with formulas, but I guess (/fear) I'll need VBA or Macros, which I have never worked with before.
All help will be appreciated!

Here is a regex based User Defined Function.
Option Explicit
Function extractMultipleExpressions(str As String, _
Optional delim As String = ", ")
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
End If
extractMultipleExpressions = vbNullString
With rgx
.Global = True
.MultiLine = False
.Pattern = "[A-Z]{2}[0-9]{3}"
If .Test(str) Then
Set cmat = .Execute(str)
'resize the nums array to accept the matches
ReDim nums(cmat.Count - 1)
'populate the nums array with the matches
For n = LBound(nums) To UBound(nums)
nums(n) = cmat.Item(n)
Next n
'convert the nums array to a delimited string
extractMultipleExpressions = Join(nums, delim)
End If
End With
End Function

I believe you are looking for something like this Press Alt + F11 and then choose Insert > Module and then paste the following code:
Public Function GetUsers(UserNameProject As String)
Dim userArray() As String
Dim users As String
Dim intPos As Integer
'this will split the users into an array based on the commas
userArray = Split(UserNameProject, ",")
'loop through the array and process any non blank element and extract user
'based on the position of the first underscore
For i = LBound(userArray) To UBound(userArray)
If Len(Trim(userArray(i))) > 0 Then
intPos = InStr(1, userArray(i), "_")
users = users & "," & Left(userArray(i), intPos - 1)
End If
Next
GetUsers = users
End Function
If your string is in A1 then use by putting =GetUsers(A1) in the approiate cell. I think this should get you started!

To clean the data of extra commas, use this formula in cell B1:
=TRIM(SUBSTITUTE(SUBSTITUTE(TRIM(SUBSTITUTE(SUBSTITUTE(A1;" ";"||");";";" "));" ";";");"||";" "))
Then use this formula in cell C1 and copy over and down to extract just the part you want from each section:
=IFERROR(INDEX(TRIM(LEFT(SUBSTITUTE(TRIM(MID(SUBSTITUTE($B1;";";REPT(" ";LEN($B1)));LEN($B1)*(ROW($A$1:INDEX($A:$A;LEN($B1)-LEN(SUBSTITUTE($B1;";";""))+1))-1)+1;LEN($B1)));"_";REPT(" ";LEN($B1)));LEN($B1)));COLUMN(A1));"")

Related

How to check for the same delimiter multiple times?

I have an Excel sheet that contains strings and numbers. All the strings I am searching for have an underscore ("_"), which is my delimiter. However, some strings have the delimiter more than once.
For example:
text_in_00
text_in_01
text_out_00
text_out_01
Other strings with just one delimiter work beautifully. But here, with two delimiters, "in" and "out" are not being differentiated, due to the delimiter only being found once. How do I find EACH delimiter in a given string?
My goal with this code is to differentiate between ranges and copy and paste these different ranges into their own individual worksheets. Also, I cannot hard-code any cells or strings, as the string names are subject to change, as well as the size of the ranges.
My code:
'Dim arr As Variant
Dim i As Long
Dim filterRange As Range
Dim delimiterItem As String 'was variant
Dim a As Range
delimiterItem = "_"
Set filterRange = FindAll(Worksheets(newSheetName).UsedRange)
For i = filterRange.Rows.Count To 2 Step -1
'arr = Split(Cells(i, 1), delimiterItem)
'For j = LBound(arr) To UBound(arr)
If Split(filterRange.Cells(i, 1).Text, delimiterItem)(0) <> Split(filterRange.Cells(i - 1, 1).Text, delimiterItem)(0) Then
Range(filterRange.Cells(i, 1).EntireRow, filterRange.Cells(i, 1).EntireRow).Insert
End If
'Next j
Next i
Note: FindAll is another function in my code that finds the values I need to be looking at. Some strings don't contain any underscores ("_"), which are values I don't need. This function just filters out what I don't need and works great. I am focusing on the portion of code below the line: Set filterRange = FindAll(Worksheets(newSheetName).UsedRange))
Note: The commented out code was something I was trying, but gave the same result.
TLDR; How do I check for each instance of the delimiter? Thank you in advance for the help.
Use the following function to get a count of how many times Char appears in your string and then use a select case construct do do whatever, based on the count.
Public Function CountChars(ByVal Source As String, ByVal Char As String) As Long
CountChars = Len(Source) - Len(Replace(Source, Char, vbNullString))
End Function
Make a function that returns the Nth index of a substring inside another:
Public Function NthIndexOf(ByVal needle As String, ByVal haystack As String, ByVal n As Long) As Long
Dim currentN As Long
Dim currentIndex As Long
Do
currentIndex = InStr(currentIndex + 1, haystack, needle, vbTextCompare)
currentN = currentN + 1
Loop Until currentIndex = Len(haystack) Or currentN = n Or currentIndex = 0
NthIndexOf = currentIndex
End Function
Now you can get the NthIndexOf("_", "text_in_00", 2) and get 8. If you tried to get the 3rd index of "_", the output would be 0.
If you want the substring between each "delimiter", then you need to Split and then iterate the array. It's unclear what you intend to do with each substring though, but you should have all the tools you need to do whatever it is that you're doing now.
delimiterItem = "_"
Set filterRange = FindAll(Worksheets(newSheetName).UsedRange)
For i = filterRange.Rows.Count To 2 Step -1
If Split(InStrRev(filterRange.Cells(i, 1).Text, delimiterItem))(0) <> Split(InStrRev(filterRange.Cells(i - 1, 1).Text, delimiterItem))(0) Then
Range(filterRange.Cells(i, 1).EntireRow, filterRange.Cells(i, 1).EntireRow).Insert
End If
Next i

How to Find Average of Numbers within a Single Cell

I've been working on this problem for awhile now and I am able to break this out using multiple cells and eventually get the average. But I am unable to construct a single formula to work out the average.
The data will be dynamic, from 2 to more than 8.
Idea is that I can paste the data into a cell and the Average will compute in another cell.
As mentioned, I was able to break it out using many separate cells and also vba, but was thinking of making the spreadsheet clear with a single cell to do this work.
Using this as an example data set:
ABC 106.375/DF 106.99/G 106.5/JK 99.5/
Output: Average = 104.84125
Just trying to retrieve the 3rd number in the data has put my formula into a huge mess and unneeded complication. =MID(G3,LEN(LEFT(G3,FIND("/",G3)-1))+LEN(MID(G3,LEN(LEFT(G3,FIND("/",G3)-1))+2,FIND("/",G3,LEN(LEFT(G3,FIND("/",G3)-1)))-2))+3,FIND("/",G3,LEN(LEFT(G3,FIND("/",G3)-1))+LEN(MID(G3,LEN(LEFT(G3,FIND("/",G3)-1))+2,FIND("/",G3,LEN(LEFT(G3,FIND("/",G3)-1)))-2))+3)-(LEN(LEFT(G3,FIND("/",G3)-1))+LEN(MID(G3,LEN(LEFT(G3,FIND("/",G3)-1))+2,FIND("/",G3,LEN(LEFT(G3,FIND("/",G3)-1)))-2))+2)-1)
I feel so limited that i am unable to keep variables and I am not even at the point when can pull all the numbers together to compute the average.
Here's a regex-based user defined function.
Option Explicit
Function avgNumsOnly(str As String, _
Optional delim As String = ", ")
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
End If
avgNumsOnly = vbNullString
With rgx
.Global = True
.MultiLine = False
.Pattern = "\d*\.?\d+"
If .Test(str) Then
Set cmat = .Execute(str)
'resize the nums array to accept the matches
ReDim nums(cmat.Count - 1)
'populate the nums array with the matches
For n = LBound(nums) To UBound(nums)
nums(n) = CDbl(cmat.Item(n))
Next n
'average the nums array
avgNumsOnly = Application.Average(nums)
End If
End With
End Function

Extract maximum number from a string

I am trying to extract all numbers from a string with a function in Excel.
In the second time, I would like to extract the maximum value contains in the string.
My string look likes:
ATCG=12.5,TTA=2.5,TGC=60.28
Desired output: 60.28
In a first time, I am trying to extract all numbers with my function but it stops only on the first figure.
Function MyCode(ByVal txt As String) As String
With CreateObject("VBScript.RegExp")
.Pattern = "\d.+"
If .test(txt) Then MyCode = .Execute(txt)(0)
End With
End Function
Here is some VBA (not vbscript) that you can adapt to you needs:
Public Function MyCode(ByVal txt As String) As String
Dim maxi As Double, db As Double
maxi = -9999
arr = Split(Replace(txt, "=", ","), ",")
For Each a In arr
If IsNumeric(a) Then
db = CDbl(a)
If db > maxi Then maxi = db
End If
Next a
MyCode = CStr(maxi)
End Function
NOTE:
This gives a String and not a Number.
EDIT#1:
In Excel-VBA, the code must be placed in a standard module.
User Defined Functions (UDFs) are very easy to install and use:
ALT-F11 brings up the VBE window
ALT-I
ALT-M opens a fresh module
paste the stuff in and close the VBE window
If you save the workbook, the UDF will be saved with it.
If you are using a version of Excel later then 2003, you must save
the file as .xlsm rather than .xlsx
To remove the UDF:
bring up the VBE window as above
clear the code out
close the VBE window
To use the UDF from Excel:
=MyCode(A1)
To learn more about macros in general, see:
http://www.mvps.org/dmcritchie/excel/getstarted.htm
and
http://msdn.microsoft.com/en-us/library/ee814735(v=office.14).aspx
and for specifics on UDFs, see:
http://www.cpearson.com/excel/WritingFunctionsInVBA.aspx
Macros must be enabled for this to work!
You don't really need VBA for this if you have a version of Excel (2010+) that includes the AGGREGATE function, you can do it with a worksheet formula:
=AGGREGATE(14,6,--TRIM(MID(SUBSTITUTE(SUBSTITUTE(A1,",",REPT(" ",99)),"=",REPT(" ",99)),seq_99,99)),1)
where seq_99 is a Named Formula that refers to:
=IF(ROW(INDEX($1:$65535,1,1):INDEX($1:$65535,255,1))=1,1,(ROW(INDEX($1:$65535,1,1):INDEX($1:$65535,255,1))-1)*99)
The function results in an array, some of the values are numeric; the AGGREGATE function returns the largest value in the array, ignoring errors.
The formulas below are for earlier versions of Excel and must be entered as array formulas, by holding down ctrl + shift while hitting enter If you do this correctly, Excel will place braces {...} around the formula.
If you have 2007, you can use IFERROR
=MAX(IFERROR(--TRIM(MID(SUBSTITUTE(SUBSTITUTE(A2,",",REPT(" ",99)),"=",REPT(" ",99)),seq_99,99)),0))
For earlier versions, you can use:
=MAX(IF(ISERROR(--TRIM(MID(SUBSTITUTE(SUBSTITUTE(A3,",",REPT(" ",99)),"=",REPT(" ",99)),seq_99,99))),0,--TRIM(MID(SUBSTITUTE(SUBSTITUTE(A3,",",REPT(" ",99)),"=",REPT(" ",99)),seq_99,99))))
Your decimal separator may be different from the US decimal separator.
Public Function MyCode(ByVal txt As String) As String
Dim maxi As Double, db As Double
maxi = -9 ^ 9
arr = Split(txt, ",")
For Each a In arr
If InStr(a, "=") Then
a = Mid(a, InStr(a, "=") + 1)
ar = Replace(a, ".", Format(0, "."))
If IsNumeric(ar) Then
db = ar
If db > maxi Then maxi = db: ok = True
End If
End If
Next a
If ok = True Then
MyCode = CStr(maxi)
End If
End Function
Collect all of the mixed numbers as doubles in an array and return the maximum value.
Option Explicit
Option Base 0 '<~~this is the default but I've included it because it has to be 0
Function maxNums(str As String)
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
End If
maxNums = vbNullString
With rgx
.Global = True
.MultiLine = False
.Pattern = "\d*\.\d*"
If .Test(str) Then
Set cmat = .Execute(str)
'resize the nums array to accept the matches
ReDim nums(cmat.Count - 1)
'populate the nums array with the matches
For n = LBound(nums) To UBound(nums)
nums(n) = CDbl(cmat.Item(n))
Next n
'test array
'Debug.Print Join(nums, ", ")
'return the maximum value found
maxNums = Application.Max(nums)
End If
End With
End Function

Regex pattern to remove certain prefixes in a word from Excel

I am trying to cleanup a set of strings in Excel to extract certain words after removing some prefixes and extra characters. Initially I was trying this with FIND, LEFT, MID, etc. Then, I came across this helpful post and trying my hand at regex.
https://superuser.com/questions/794536/excel-formulas-for-stripping-out-prefix-suffix-around-number
I have used the UDF given there called Remove which takes a regex argument. Now, I am still not able to remove all the items I wanted to remove.
In the attached Excel you can see what I have tried and what the answer I am looking.
Here are the Prefixes I wanted to remove:
The numbers in the beginning surrounded by brackets - Ideally I want this in a separate column.
Anyword before a hyphen here there are a number of them 'l-', 'al-'
and then these prefixes below.
 bi
 bil
 fa
 wa
 wal
How do I write a single regex which would remove all the above prefixes?
Here is the UDF I am using:
Function Remove(objCell As Range, strPattern As String)
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = strPattern
Remove = RegEx.Replace(objCell.Value, "")
End Function
Here is the link to the XLSM file which contains the data I have:
https://www.dropbox.com/s/et9ee727ompj5fl/Regex%20Trials.xlsm?dl=0
and here is a screenshot to show you what I am looking for:
Not 100% perfect for words but should get you started
Breakdown of RegEx (\d+\:)+\d+
(\d+\:) finds any patterns that match the format x:
the plus after the bracket then tells it that this is a repeating pattern.
lastly the \d+ matches the last digit in the string so that the regex will find a pattern that matches x:x:x
The next RegEx (?!l-|al-|a-|wa-|fa-|bi-)[a-z].* is a lot more complex.
First of all lets look at the [a-z]. This tells it to match any character between a and z. We then want to capture the rest of the word so by using .* it captures everything from the first match to the end of the string (this includes non a-z characters). However, we don't want it to capture the first part of the string before the hyphen (in most cases) so by using ?! We use what's called negative look ahead. This looks for anything inside the brackets and ignores those bits. | simply means or. so anything inside that bracket will be ignored from the match.
Go to http://regexr.com/ if you want to have a play around is a handy site to learn/test RegEx
Public Sub test()
Dim rng As Range
Dim matches
Dim c
With Sheet1
Set rng = .Range(.Cells(2, 1), .Cells(.Cells(.Rows.Count, 1).End(xlUp).Row, 1))
End With
For Each c In rng
With c
.Offset(0, 6) = ExecuteRegEx(.Value2, "(\d+\:)+\d+")
.Offset(0, 7) = ExecuteRegEx(.Value2, "(?!l-|al-|a-|wa-|fa-|bi-)[a-z].*")
End With
Next c
End Sub
Public Function ExecuteRegEx(str As String, pattern As String) As String
Dim RegEx As Object
Dim matches
Set RegEx = CreateObject("VBScript.RegExp")
With RegEx
.Global = True
.ignorecase = False
.pattern = pattern
If .test(str) Then
Set matches = .Execute(str)
ExecuteRegEx = matches(0)
Else
ExecuteRegEx = vbNullString
End If
End With
End Function
I wouldn't use a regex for this: you can do some splitting of the cell value and testing of the prefixs against a defined array of prefixs:
Note: the array values are in an order where substrings of other prefixs are later in the list
Public Function RemovePrefix(RngSrc As Range) As String
If RngSrc.Count > 1 Then Exit Function
On Error GoTo ExitFunction
Dim Prefixs() As String: Prefixs = Split("wal,wa',wa,bil,bi,fa", ",")
Dim Arr() As String, i As Long, Temp As String
Arr = Split(RngSrc, "-")
If UBound(Arr) > 0 Then
RemovePrefix = Arr(UBound(Arr))
Exit Function
End If
Arr = Split(RngSrc, " ")
For i = 0 To UBound(Prefixs)
Temp = Arr(UBound(Arr))
If InStr(Temp, Prefixs(i)) = 1 Then
RemovePrefix = Right(Temp, Len(Temp) - Len(Prefixs(i)))
Exit Function
End If
Next i
RemovePrefix = Temp
ExitFunction:
If Err Then RemovePrefix = "Error"
End Function

How to extract text within a string of text

I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.
I have a file with multiple rows (all one column). Each row has data that looks something like this:
1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom
2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate
etc...
What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.
What I would hope the output would look like would be something like:
297848936,297338191
168010496
etc...
Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/
You would call it like this, assuming that first string is in A1:
=RegexExtract(A1,"gi[|](\d+)[|]")
Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:
297848936, 297338191
Just run this down the column and you're all done!
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
Next
Next
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function
Here it is (assuming data is in column A)
=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))
Not the nicest formula, but it will work to extract the number.
I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.
In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:
=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ),
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1)
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2))
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))
How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.
To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!
I'd probably split the data first on the | delimiter using the convert text to columns wizard.
In Excel 2007 that is on the Data tab, Data Tools group and then choose Text to Columns. Specify Other: and | as the delimiter.
From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.
As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.
Just saw that #Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.
Option Explicit
Option Base 0
Sub findGi()
Dim oCell As Excel.Range
Set oCell = Sheets(1).Range("A1")
'Loops through every row until empty cell
While Not oCell.Value = ""
oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
Set oCell = oCell.Offset(1, 0)
Wend
End Sub
Private Function GetGi(ByVal sValue As String) As String
Dim sResult As String
Dim vArray As Variant
Dim vItem As Variant
Dim iCount As Integer
vArray = Split(sValue, "|")
iCount = 0
'Loops through the array...
For Each vItem In vArray
'Searches for the 'Gi' factor...
If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then
'Concatenates the results...
sResult = sResult & vArray(iCount + 1) & ","
End If
iCount = iCount + 1
Next vItem
'And removes trail comma
If Len(sResult) > 0 Then
sResult = Left(sResult, Len(sResult) - 1)
End If
GetGi = sResult
End Function
open your excel in Google Sheets and use the regular expression with REGEXEXTRACT
Sample Usage
=REGEXEXTRACT("My favorite number is 241, but my friend's is 17", "\d+")
Tip: REGEXEXTRACT will return 241 in this example because it returns the first matching case.
In your case
=REGEXEXTRACT(A1,"gi[|](\d+)[|]")

Resources