My workplace is changing CMS systems and we have around 5,000 products to import. The problem comes with image URL formatting as the two systems are laid out vastly different. I need a function or VB code to convert one cell:
into two cells containing:
and this is where it gets tricky:
Notice how the "Main|1|" has been removed also, the tricky part is that not all of these begin with or contain "Main|1|" and not all of the options begin with or contain "Vaterra".
The main steps would be to remove each image's suffixes and then capture the line of text up to ".jpg" and move it to a separate cell.
As you have VBA tag, here is a quickest VBA approach.
Assuming your your data is in column A starting from row 1 on sheet1.
This macro will write the below two lines in column B and C respectively.
Column B
Column C
Here is the macro.
Public RegMatchArray
Sub test()
Dim sh As Worksheet
Dim rowCount As Long
Dim i, j As Integer
Dim strValue, strValue1, strValue2 As String
Set sh = Sheets("Sheet1")
rowCount = sh.Range("A1048576").End(xlUp).Row
For i = 1 To rowCount
strValue = sh.Cells(i, 1).Value
If InStr(1, strValue, "Main|1|") > 0 Then
strValue = Replace(strValue, "Main|1|", "")
End If
iPos = InStr(1, strValue, ";")
strValue1 = Left(strValue, iPos - 1)
strValue2 = Mid(strValue, iPos + 1, Len(strValue) - iPos - 1)
Call splitUpRegexPattern(strValue2, "([\w\s-]+?)\/([\w\s-]+?\.jpg)")
For j = LBound(RegMatchArray) To UBound(RegMatchArray)
If j < 1 Then
strValue2 = RegMatchArray(j)
strValue2 = strValue2 & ";" & RegMatchArray(j)
End If
sh.Cells(i, 2).Value = strValue1
sh.Cells(i, 3).Value = strValue2
Set sh = Nothing
End Sub
Public Function splitUpRegexPattern(targetString, strPattern)
Dim regEx As New RegExp
Dim strReplace As String
Dim arrArray()
i = 0
regEx.Pattern = strPattern
regEx.IgnoreCase = True
regEx.Global = True
Set Matches = regEx.Execute(targetString)
If Matches.Count = 0 Then
RegMatchArray = ""
For Each Match In Matches
ReDim Preserve arrArray(i)
arrArray(i) = Match.Value
i = i + 1
RegMatchArray = arrArray
RegExpMultiSearch = 0
End If
If IsObject(regEx) Then
Set regEx = Nothing
End If
If IsObject(Matches) Then
Set Matches = Nothing
End If
End Function
Note: You have to add "Microsoft VBSript Regular Expressions 5.5" reference by going into Tools -> References.
If you don't want to keep the original column A, change the below lines. This will delete the original data and give you the result in column A and B.
sh.Cells(i, 2).Value = strValue1
sh.Cells(i, 3).Value = strValue2
sh.Cells(i, 1).Value = strValue1
sh.Cells(i, 2).Value = strValue2
With some tweeks, you will be able to make it happen without VBA.
First, replace | and / with ; so that you can have a consistent delimiter.
Also, you can remove Main|1| by replacing it with empty space.
Now, choose Data => Text to Columns
Choose the option Delimeted
you can now use the delimeter semicolon and you will have data in separate cells with the as in each cell.
You can now remove unwanted entries.
As an alternate, here is a formula solution. Assuming the large single block of text is in cell A1, put this formula in cell B1 and copy down until it starts giving you errors:
=TRIM(MID(SUBSTITUTE("|"&$A$1,";",REPT(" ",LEN($A$1))),LEN($A$1)*(ROW(A1)-1)+1+LOOKUP(2,1/(MID(SUBSTITUTE("|"&$A$1,";",REPT(" ",LEN($A$1))),LEN($A$1)*(ROW(A1)-1)+ROW(INDIRECT("1:"&LEN($A$1))),1)="|"),ROW(INDIRECT("1:"&LEN($A$1)))),LEN($A$1)))
The errors mean that there are no more entries to return, so you can delete the cells with errors, and then select all the cells with the formula -> Copy -> Right-click -> Paste Special -> Values to convert them to just be text instead of formulas. (I highly recommend doing that because the Indirect function is volatile and can greatly slow down your workbook if you have many formula cells with it.)
I've got a workbook containing a Summary sheet and 200 numbered sheets that the user fills in one after the other.
The following macro checks about 125 cell values on every numbered sheet, and fills in the Summary, one line per numbered sheet.
If a numbered sheet hasnt been used yet, the macro fills in every column from column D to column DV with the minus sign "-" and goes on to check every numbered sheet one after the other till there's no more to check.
Is there a way to set it so that if an arbitrary number (let's say 10 lines) of the newly generated lines contain only the minus sign "-" from D to DV (Iw,4 to Iw, 126), then the macro would reach its end as it means all the remaining numbered sheets aren't used yet?
Sub SummaryMacro()
Dim Sh As Worksheet
Iw = 2 ' Index Write
For Each Sh In ActiveWorkbook.Sheets
If Sh.Name = "Summary" Then GoTo EndConsolidation
Cells(Iw, 1).Select
With Selection
.Hyperlinks.Add Anchor:=Selection, Address:="", SubAddress:="'" & Sh.Name & "'" & "!" & "A1", TextToDisplay:="Go to"
End With
Cells(Iw, 2) = Sh.Name
If Sh.Range("D8") = "" Then
Cells(Iw, 3) = "-"
Cells(Iw, 3) = Sh.Range("D8")
End If
'Here the rest of the process (Iw, 4 till Iw, 125)
'The process also includes a few variations:
'Something like 20 of those with various text
If Sh.CheckBoxes("Check Box 1").Value = 1 Then Cells(Iw, 40) = "Declared" Else Cells(Iw, 40) = "-"
'Something like 30 of those with various text
If Sh.Range("H33") = "Issued" Then
Cells(Iw, 42) = "-"
Cells(Iw, 42) = Sh.Range("H33")
End If
'But all in all they are mostly like that
If Sh.Range("C134") = "" Then
Cells(Iw, 126) = "-"
Cells(Iw, 126) = Sh.Range("C134")
End If
Iw = Iw + 1
Next Sh
End Sub
Try adding this code to your For loop at the end:
If (WorksheetFunction.CountIf(Range("D" & Iw & ":DV" & Iw), "-") = 123) Then
Cntr = Cntr + 1 'Blank sheet found
Cntr = 0 'Not blank - Restart counter
End If
If (Cntr = 10) Then Exit For
This counts the number of - in your row and if it equals 123 (D-DV) then it increments the counter otherwise it clears the counter. When Cntr reaches 10 it exits the loop.
Add this code before your For loop ends
Dim counter As Integer
Dim previousRowBlank As Boolean
counter = 0
previousRowBlank = True
'count if all the 123 cells contain - string
If (WorksheetFunction.CountIf(Sheets("Summary").Range("D" & Iw & ":DV" & Iw), "-") = 123) Then
If (counter = 0) Then
counter = counter + 1
previousRowBlank = True
If (previousRowBlank = True) Then
counter = counter + 1
End If
End If
previousRowBlank = False
counter = 0
End If
'assuming you want to exit when 10 consecutive rows are blank
If (counter = 10) Then
Exit Sub
End If
When I have something like this I Dim a Boolean variable (perhaps call it isPopulated) which only gets switched to true when one of the cells has a value to act on. Then for your case after 10 (or however many you choose) lines, insert an If isPopulated = False Then Exit For to skip the remaining sheets.
EDIT; another idea I just had for you - if all the cells you're checking are supposed to have numeric values then you could use the below;
If Not WorksheetFunction.Concat(Range("D8"), Range("C134"), etc) Like "*#*" Then
'Code here to skip this and remaining sheets.
Obviously you'd need to add the relevant ranges inside the concat() brackets. What that will do is join the contents of those cells together, then check the result for any numbers "*#*" (you could also check for any letters using "*?*"). That gives you a one-code-line answer to the basic question 'is this sheet populated or not'.
I'm sure it's a bad idea to terminate the macro prematurely, based on such an imprecise criterion as the number of "empty" sheets in series. If data starts again on the 11th, 15th or 30th sheet, then you will not process it, you will lose it.
Your macro is not very complex, it shouldn't take longer than a few seconds. For modern Excel, 25K cells are very few
Your code can be shortened a little, simplified. After all, you know all the addresses of the cells that you need to check on each sheet, you enter them in the macro code sequentially, right? Write them on one line separated by commas and put them in a constant.
After that, the whole code will become much shorter:
Sub SummaryMacro()
Const REQUIRED_CELLS_ADDRESS As String = "D8,...<all other source cells>...,B6"
Const SUM_SHEETNAME As String = "Summary"
Dim ws As Worksheet
Dim wsSum As Worksheet
Dim rCell As Range
Dim oTargetCell As Range
Dim oSumCell As Range
Dim aAddress As Variant
Dim i As Integer
aAddress = Split(REQUIRED_CELLS_ADDRESS, ",")
Set wsSum = ActiveWorkbook.Worksheets(SUM_SHEETNAME)
wsSum.UsedRange.Offset(1, 0).ClearContents
Set oTargetCell = wsSum.Range("A1")
For Each ws In ActiveWorkbook.Worksheets
If ws.Name <> SUM_SHEETNAME Then
Set oTargetCell = oTargetCell.Offset(1, 0)
wsSum.Hyperlinks.Add Anchor:=oTargetCell, Address:="", SubAddress:="'" & ws.Name & "'" & "!" & "A1", TextToDisplay:="Go to"
oTargetCell.Resize(1, 123).Value = "-"
Set oSumCell = oTargetCell.Offset(0, 1)
oSumCell.Value = ws.Name
For i = LBound(aAddress) To UBound(aAddress)
Set rCell = ws.Range(aAddress(i))
Set oSumCell = oSumCell.Offset(0, 1)
If Not IsEmpty(rCell) Then oSumCell.Value2 = rCell.Value2
Next i
End If
Next ws
End Sub
Update Everyone knows that working with an array in RAM is much faster than working with sheet cells. Therefore, the outer loop - iterating over the sheets of the book - remains the same, but we change the code inside the loop in this way:
Sub SummaryMacro()
Const SUM_SHEETNAME As String = "Summary"
Dim ws As Worksheet
Dim wsSum As Worksheet
Dim oTargetCell As Range
Dim aResData As Variant
aAddress = Split(REQUIRED_CELLS_ADDRESS, ",")
Set wsSum = ActiveWorkbook.Worksheets(SUM_SHEETNAME)
wsSum.UsedRange.Offset(1, 0).ClearContents
Set oTargetCell = wsSum.Range("A1")
For Each ws In ActiveWorkbook.Worksheets
If ws.Name <> SUM_SHEETNAME Then
Set oTargetCell = oTargetCell.Offset(1, 0)
wsSum.Hyperlinks.Add Anchor:=oTargetCell, Address:="", SubAddress:="'" & ws.Name & "'" & "!" & "A1", TextToDisplay:="Go to " & ws.Name
aResData = validateData(ws.Range("A1:L140").Value2) ' Or "D8:C134" or any other
oTargetCell.Offset(0, 1).Resize(1, UBound(aResData) + 1).Value = aResData
End If
Next ws
End Sub
The main trick is hidden in this line aResData = validateData(ws.Range("A1:L140").Value2)
We call our function and pass it as a parameter an array of cell values from the entire next sheet. Further work on analysis and processing will be carried out with the elements of this array. However, this is not the whole trick.
The validateData() function is very simple and looks like this:
Function validateData(aD As Variant) As Variant
validateData = validateValues(aD(1, 5), aD(2, 8), aD(3, 1), aD(2, 11), _
........ , _
aD(111, 3), aD(112, 8), aD(123, 9), aD(126, 10))
End Function
In other words, we select from the entire large array of aD (the name is deliberately made short, because in this function it will have to be repeated 123 times) only those values that need to be analyzed and we pass on to the next function. Despite the seeming simplicity, this is the most time-consuming part - you need to select from the sheet all cells "D8", "C134", etc. and write down their coordinates (row, column) as numbers aD(4,8), aD(3,134), etc.
Perhaps can help in this the mode R1C1 of displaying the sheet. Or some kind of helper function that will be used when preparing the code (not when executing - we agreed that we will no longer access cells for get values or for .Row and .Column properties!)
What will the validateData() function get? A long one-dimensional array aData(0 To 122) of cell values in the listed order. That is, as many values as there are cells to be filled in the Summary row for this sheet.
The last trick is the process of processing values. It would seem that we have gained nothing from all these transformations. But you claim that there are three groups of checks - for an empty value, for a boolean value (checkbox) and for text lines. This is how it is handled:
Function validateValues(ParamArray aData() As Variant) As Variant
Dim i As Variant
Dim aResult As Variant
ReDim aResult(LBound(aData) To UBound(aData))
For i = LBound(aData) To UBound(aData)
Select Case i
Case 1, 5, 7, 9 ' Checking cells empty / value
aResult(i) = IIf(aData(i) = "", "-", aData(i))
Case 4, 6, 10 ' Checking cells boolean True / "not True" (False or blank)
aResult(i) = IIf(aData(i), "Declared", "-")
Case 0, 3, 8 ' Checking cells string "Issued" / other
aResult(i) = IIf(aData(i) = "Issued", "-", aData(i))
Case 2, 91, 118 ' Checking cells string "Pending" / other
aResult(i) = IIf(aData(i) = "Issued", "-", aData(i))
Case Else ' In a real macro, this line is not needed, it will never be executed because all the cells of the array are already listed above, this is useful only for debugging while all conditions will be written
aResult(i) = "-"
Debug.Print "Cell #" & i & " not processed yet"
End Select
Next i
validateValues = aResult
End Function
And now - again, in just one call! - we write a whole row of results:
oTargetCell.Offset(0, 1).Resize(1, UBound(aResData) + 1).Value = aResData
I am sure that these tricks will reduce the time it takes to form the summary sheet many times over. Please try this and let me know if it gets better?
I am new to both VBA and stackoverflow. So please be patient ;).
I searched for a solution but could not find it.
My problem is as follows:
I have a column (A) with names and then a column (B) where some cells contain an "X" and others do not. I want to know which names have an "X" besides them.
I want now a string as a result, in one cell.
In this example:
I got not very far.
For r = 1 To 20
If Cells(r, 2) = "X" Then A = Cells(r, 1) Else
Then "A" is "Noah" and I can write it in a cell, but I want it to find all values and then write them combined, preferable seperated by ; in a cell.
Does anyone have any idea?
Create a string variable, then append your results to that variable based on "X" being in column B. Here's an example of how you could do it:
Sub Foo()
Dim i As Integer
Dim result As String
For i = 1 To 20
If UCase(Cells(i, 2).Value) = "X" Then
result = result & Cells(i, 1).Value & ";"
End If
'// output the result to C1
Range("C1").Value = Left$(result, Len(result) - 1)
End Sub
Excel's native worksheet formulas do not handle concatenating an unknown number of strings together and compensating for the maximum number possible can get messy. A User Defined Function¹ (aka UDF) takes advantage of VBA's ability to process loops through a large number of rows while making numerical or string comparisons 'on-the-fly'.
build_List UDF
Function build_List(rNAMs As Range, rEXs As Range, vEX As Variant, _
Optional delim As String = ";", _
Optional bCS As Boolean = False)
Dim str As String, rw As Long, cl As Long
With rNAMs.Parent
Set rNAMs = Intersect(.UsedRange, rNAMs)
Set rEXs = .Cells(rEXs.Rows(1).Row, rEXs.Columns(1).Column). _
Resize(rNAMs.Rows.Count, rNAMs.Columns.Count)
End With
With rNAMs
For rw = .Rows(1).Row To .Rows(.Rows.Count).Row
For cl = .Columns(1).Row To .Columns(.Columns.Count).Row
If (.Cells(rw, cl).Offset(0, rEXs.Column + (cl - 1) - cl) = vEX And bCS) Or _
(LCase(.Cells(rw, cl).Offset(0, rEXs.Column + (cl - 1) - cl)) = LCase(vEX)) Then _
str = str & .Cells(rw, cl).Value & delim
Next cl
Next rw
End With
build_List = Left(str, Len(str) - Len(delim))
End Function
In D7 (as per image below) as,
=build_List(A:A, B:B, "x")
Applying the build_Lists UDf to your sample data
¹ A User Defined Function (aka UDF) is placed into a standard module code sheet. Tap Alt+F11 and when the VBE opens, immediately use the pull-down menus to Insert ► Module (Alt+I,M). Paste the function code into the new module code sheet titled something like Book1 - Module1 (Code). Tap Alt+Q to return to your worksheet(s).
Mate Juhasz answered the question very nice and simple, but now the answer dissapeared.
Mate wrote:
For r = 1 To 20
If Cells(r, 2) = "X" Then A = A & "; " & Cells(r, 1) Else
And for me that solved it perfectly. Now "A" is a string as I wanted. Thank you so much!
Please be aware that I am working with a series of ~1000 line medical information databases. Due to the size of the databases, manual manipulation of the data is too time consuming. As such, I have attempted to learn VBA and code an Excel 2010 macro using VBA to help me accomplish parsing certain data. The desired output is to split certain characters from a provided string on each line of the database as follows:
will need to be split into
Active Row Active Column = 99204 ActiveRow Active Column+3 = OFFICE/OUTPATIENT VISIT, NEW
I have researched this topic using Walkenbach's "Excel 2013: Power Programming with VBA" and a fair amount of web resources, including this awesome site, but have been unable to develop a fully-workable solution using VBA in Excel. The code for my current macro is:
Sub EasySplit()
Dim text As String
Dim a As Integer
Dim name As Variant
text = ActiveCell.Value
name = Split(text, "-", 2)
For a = 0 To 1
Cells(1, a + 3).Value = Trim(name(a))
Next a
End Sub
The code uses the "-" character as a delimiter to split the input string into two substrings (I have limited the output strings to 2, as there exists in some input strings multiple "-" characters). I have trimmed the second string output to remove leading spaces.
The trouble that I am having is that the output is being presented at the top of the activesheet, instead of on the activerow.
Thank you in advance for any help. I have been working on this for 2 days and although I have made some progress, I feel that I have reached an impasse. I think that the issue is somewhere in the
Cells(1, a + 3).Value = Trim(name(a))
code, specifically with "Cells()".
Thank you Conrad Frix!
Yah.. funny enough. Just after I post I have a brainstorm.. and modify the code to read:
Sub EasySplit()
Dim text As String
Dim a As Integer
Dim name As Variant
text = ActiveCell.Value
name = Split(text, "-", 2)
For a = 0 To 1
ActiveCell.Offset(0, 3 + a).Value = Trim(name(a))
Next a
End Sub
Not quite the colkumn1,column4 output that I want (it outputs to column3,column4), but it will work for my purpose.
Now I need to incorporate a loop so that the code runs on each successive cell in the column (downwards, step 1) skipping all bolded cells, until it hits an empty cell.
Modified answer to modified request.
This will start on row 1 and continue until a blank cell is found in column A. If you would like to start on a different row, perhaps row 2 if you have headers, change the
i = 1
line to
i = 2
I added a check on the upper bound of our variant before doing the output writes, in case the macro is run again on already formatted cells. (Does nothing instead of erroring out)
Sub EasySplit()
Dim initialText As String
Dim i As Double
Dim name As Variant
i = 1
Do While Trim(Cells(i, 1)) <> ""
If Not Cells(i, 1).Font.Bold Then
initialText = Cells(i, 1).text
name = Split(initialText, "-", 2)
If Not UBound(name) < 1 Then
Cells(i, 1) = Trim(name(0))
Cells(i, 4) = Trim(name(1))
End If
End If
i = i + 1
End Sub
just add a variable to keep track of the active row and then use that in place of the constant 1.
Dim iRow as Integer = ActiveCell.Row
For a = 0 To 1
Cells(iRow , a + 3).Value = Trim(name(a))
Next a
Alternate method utilizing TextToColumns. This code also avoids using a loop, making it more efficient and much faster. Comments have been added to assist with understanding the code.
EDIT: I have expanded the code to make it more versatile by using a temp worksheet. You can then output the two columns to wherever you'd like. As stated in your original question, the output is now to columns 1 and 4.
Sub tgr()
Const DataCol As String = "A" 'Change to the correct column letter
Const HeaderRow As Long = 1 'Change to be the correct header row
Dim rngOriginal As Range 'Use this variable to capture your original data
'Capture the original data, starting in Data column and the header row + 1
Set rngOriginal = Range(DataCol & HeaderRow + 1, Cells(Rows.Count, DataCol).End(xlUp))
If rngOriginal.Row < HeaderRow + 1 Then Exit Sub 'No data
'We will be using a temp worksheet, and to avoid a prompt when we delete the temp worksheet we turn off alerts
'We also turn off screenupdating to prevent "screen flickering"
Application.DisplayAlerts = False
Application.ScreenUpdating = False
'Move the original data to a temp worksheet to perform the split
'To avoid having leading/trailing spaces, replace all instances of " - " with simply "-"
'Lastly, move the split data to desired locations and remove the temp worksheet
With Sheets.Add.Range("A1").Resize(rngOriginal.Rows.Count)
.Value = rngOriginal.Value
.Replace " - ", "-"
.TextToColumns .Cells, xlDelimited, Other:=True, OtherChar:="-"
rngOriginal.Value = .Value
rngOriginal.Offset(, 3).Value = .Offset(, 1).Value
End With
'Now that all operations have completed, turn alerts and screenupdating back on
Application.DisplayAlerts = True
Application.ScreenUpdating = True
End Sub
You can do this in a single shot without looping using the VBA equivalent of entering this formula, then taking values only
as a formula
=IF(NOT(ISERROR(FIND("-",A1))),RIGHT(A1,LEN(A1)-FIND("-",A1)-1 ),A1)
Sub Quicker()
Dim rng1 As Range
Set rng1 = Range([a1], Cells(Rows.Count, "A").End(xlUp))
With rng1.Offset(0, 3)
.FormulaR1C1 = "=IF(NOT(ISERROR(FIND(""-"",RC[-3]))),RIGHT(RC[-3],LEN(RC[-3])-FIND(""-"",RC[-3])-1 ),RC[-3])"
.Value = .Value
End With
End Sub
I have column H that contains long GET requests on sheet 1 such as:
GET /profiles/text/23493495_3492/g93id93kd
GET /edit/result/393493/te3903k4d
I would like to have a second sheet with the following type of list in columns A and B:
23493495 identifier1
3903k4 realid2
g93id realid3
Ultimately, I would like a function that will search sheet 1 column H for any of the values in sheet 2 column A. Most of the time there is no separator so I need it to search for strings within the GET string. Once a value in sheet 2 column A is matched with a value in sheet 1 column H, I would like the function to take the corresponding text in sheet 2 column B and print it in sheet 1 column I. There may be multiple matches in a cell, so that would need to be taken into account. So if using the example above:
In H1, there would be a match of 23493495 and g93id within the string. I would like sheet 1 column I to display:
identifier1, realid3
I initially started with the below code where I had to specify the list but it doesn't use a second sheet or print the corresponding text of the match. So I would rather have something that meets my needs above, but below is an example of what I have tried so far:
=ListSearchB(J2, "23493495 g93id")
With this module I found that I modified a little:
Function ListSearchB(text As String, wordlist As String, Optional caseSensitive As Boolean = False)
Dim strMatches As String
Dim res As Variant
Dim arrWords() As String
arrWords = Split(wordlist)
On Error Resume Next
For Each word In arrWords
If caseSensitive = False Then
res = InStr(LCase(text), LCase(word))
res = InStr(text, word)
End If
If res > 0 Then
strMatches = strMatches & word
End If
Next word
If Len(strMatches) <> 0 Then
strMatches = Right(strMatches, Len(strMatches))
End If
ListSearchB = strMatches
End Function
That gives me:
23493495g93id in column I, and I wasn't sure how to separate the two with a comma.
In general though, I would prefer, to use some way to pull the list from sheet 2 and display the value in column I as specified initially.
Give this a try - just adjust the sheet names where commented before running
Sub your_sub()
Dim sGet As Worksheet
Dim sIDs As Worksheet
Dim rget As Range
Dim rIds As Range
With Worksheets("GET")
Set rget = Range(.Range("H1"), .Range("h" & .Rows.count).End(xlUp))
End With
With Worksheets("IDs")
Set rIds = Range(.Range("A1"), .Range("A" & .Rows.count).End(xlUp))
End With
mys = vbNullString
i = 1
For Each cget In rget
For Each cIds In rIds
If InStr(cget.Value, cIds) <> 0 Then
mys = mys & ", " & cIds.Offset(0, 1).Value
End If
Next cIds
If mys <> vbNullString Then
mys = Right(mys, Len(mys) - 2)
Worksheets("GET").Range("I" & i).Value = mys
End If
i = i + 1
mys = vbNullString
Next cget
End Sub
I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.
I have a file with multiple rows (all one column). Each row has data that looks something like this:
1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom
2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate
What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.
What I would hope the output would look like would be something like:
Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here:
You would call it like this, assuming that first string is in A1:
Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:
297848936, 297338191
Just run this down the column and you're all done!
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function
Here it is (assuming data is in column A)
=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))
Not the nicest formula, but it will work to extract the number.
I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.
In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ),
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1)
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2))
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))
How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.
To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!
I'd probably split the data first on the | delimiter using the convert text to columns wizard.
In Excel 2007 that is on the Data tab, Data Tools group and then choose Text to Columns. Specify Other: and | as the delimiter.
From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.
As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.
Just saw that #Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.
Option Explicit
Option Base 0
Sub findGi()
Dim oCell As Excel.Range
Set oCell = Sheets(1).Range("A1")
'Loops through every row until empty cell
While Not oCell.Value = ""
oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
Set oCell = oCell.Offset(1, 0)
End Sub
Private Function GetGi(ByVal sValue As String) As String
Dim sResult As String
Dim vArray As Variant
Dim vItem As Variant
Dim iCount As Integer
vArray = Split(sValue, "|")
iCount = 0
'Loops through the array...
For Each vItem In vArray
'Searches for the 'Gi' factor...
If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then
'Concatenates the results...
sResult = sResult & vArray(iCount + 1) & ","
End If
iCount = iCount + 1
Next vItem
'And removes trail comma
If Len(sResult) > 0 Then
sResult = Left(sResult, Len(sResult) - 1)
End If
GetGi = sResult
End Function
open your excel in Google Sheets and use the regular expression with REGEXEXTRACT
Sample Usage
=REGEXEXTRACT("My favorite number is 241, but my friend's is 17", "\d+")
Tip: REGEXEXTRACT will return 241 in this example because it returns the first matching case.
In your case